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Abstract 

Many problems in sequential decision making and stochastic control often have natu- 
ral multiscale structure: sub-tasks are assembled together to accomplish complex goals. 
Systematically inferring and leveraging hierarchical structure, particularly beyond a single 
level of abstraction, has remained a longstanding challenge. We describe a fast multi- 
scale procedure for repeatedly compressing, or homogenizing, Markov decision processes 
(MDPs), wherein a hierarchy of sub-problems at different scales is automatically deter- 
mined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be 
solved using existing algorithms. The multiscale representation delivered by this procedure 
decouples sub-tasks from each other and can lead to substantial improvements in conver- 
gence rates both locally within sub-problems and globally across sub-problems, yielding 
significant computational savings. A second fundamental aspect of this work is that these 
multiscale decompositions yield new transfer opportunities across different problems, where 
solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to 
new problems. Localized transfer of policies and potential operators at arbitrary scales is 
emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative 
domains, including examples involving discrete and continuous statespaces. 

Keywords: Markov decision processes, hierarchical reinforcement learning, transfer, 
multiscale analysis. 



1. Introduction 

Identifying and leveraging hierarchical structure has been a key, longstanding challenge for 
sequential decision making and planning research fSutton et al., 1999; Dietterich, 2000 



Parr and Russell, 1998). Hierarchical structure generally suggests a decomposition of a 



complex problem into smaller, simpler sub-tasks, which may be, ideally, considered inde- 



pendently (Barto and Mahadevan, 2003). One or more layers of abstraction may also provide 



a broad mechanism for reusing or transferring commonly occurring sub-tasks among related 



problems (Barry et al.[ |2011t [Taylor and Stone[ |2009t [Soni and Singh[ |2006t [Ferguson and 
Mahadevan, 2006). These themes are restatements of the divide-and-conquer principle: it is 



usually dramatically cheaper to solve a collection of small problems than a single big prob- 
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lem, when the solution of each problem involves a number of computations super-linear in 
the size of the problem. Two ingredients are often sought for efficient divide-and-conquer 
approaches: a hierarchical subdivision of a large problem into disjoint subproblems, and a 
procedure merging the solution of subproblems into the solution of a larger problem. 

This paper considers the discovery and use of hierarchical structure - multiscale structure 
in particular - in the context of discrete-time Markov decision problems. Fundamentally, 
inferring multiscale decompositions, learning abstract actions, and planning across scales 
are intimately related concepts, and we couple these elements tightly within a unifying 
framework. Two main contributions are presented: 

• The first is an efficient multiscale procedure for partitioning and then repeatedly 
compressing or homogenizing Markov decision processes (MDPs). 

• The second contribution consists of a means for identifying transfer opportunities, 
representing transferrable information, and incorporating this information into new 
problems, within the context of the multiscale decomposition. 

Several possible approaches to multiscale partitioning are considered, in which statespace 
geometry, intrinsic dimension, and the reward structure play prominent roles, although a 
wide range of existing algorithms may be chosen. Regardless of how the partitioning is 
accomplished, the statespace is divided into a multiscale collection of "clusters" connected 
via a small set of "bottleneck" states. A key function of the partitioning step is to tie 
computational complexity to problem complexity. It is problem complexity, controlled for 
instance by the inherent geometry of a problem, and its amenability to being partitioned, 
that should determine the computational complexity, rather than the choice of statespace 
representation or sampling. The decomposition we suggest attempts to rectify this difficulty. 

The multiscale compression procedure is local and efficient because only one cluster 
needs to be considered at a time. The result of the compression step is a representation 
decomposing a problem into a hierarchy of distinct sub-problems at multiple scales, each 
of which may be solved efficiently and independently of the others. The homogenization 
we propose is perfectly recursive in that a compressed MDP is again another independent, 
deterministic MDP, and the statespace of the compressed MDP is a (small) subset of the 
original problem's statespace. Moreover, each coarse MDP in a multiscale hierarchy is "con- 
sistent in the mean" with the underlying fine scale problem. The compressed representation 
coarsely summarizes a problem's statespace, reward structure and Markov transition dy- 
namics, and may be computed either analytically or by Monte-Carlo simulations. Actions 
at coarser scales are typically complex, "macro" actions, and the coarsening procedure may 
be thought of as producing different levels of abstraction of the original problem. In an 
appropriate sense, optimal value functions at homogenized scales are homogenized optimal 
value functions at the finer scale. 

Given such a hierarchy of successively coarsened representations, an MDP may be solved 
efficiently. We describe a family of multiscale solution algorithms which realize computa- 
tional savings in two ways: (1) Localization: computation is restricted to small, decoupled 
sub-problems; and (2) conditioning: sub-problems are comparatively well-conditioned due 
to improved local mixing times at finer scales and fast mixing globally at coarse scales, 
and obey a form of global consistency with each other through coarser scales, which are 
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themselves well-conditioned coarse MDPs. The key idea behind these algorithms is that 
sub-problems at a given scale decouple conditional on a solution at the next coarser scale, 
but must contribute constructively towards solving the overarching problem through the 
coarse solution; interleaved updates to solutions at pairs of fine and coarse scales are repeat- 
edly applied until convergence. We present one particular algorithm in detail: a localized 
variant of modified asynchronous policy iteration that can achieve, under suitable geometric 
assumptions on the problem, a cost of 0{nlogn) per iteration, if there are n states. The 
algorithm is also shown to converge to a globally optimal solution from any initial condition. 

Solutions to sub-problems may be transferred among related tasks, giving a systematic 
means to approach transfer learning at multiple scales in planning and reinforcement learn- 
ing domains. If a learning problem can be decomposed into a hierarchy of distinct parts 
then there is hope that both a "meta policy" governing transitions between the parts, as 
well as some of the parts themselves, may be transferred when appropriate. We propose a 
novel form of multiscale knowledge transfer between sufficiently related MDPs that is made 
possible by the multiscale framework: Transfer between two hierarchies proceeds by match- 
ing sub-problems at various scales, transferring policies, value functions and/or potential 
operators where appropriate (and where it has been determined that transfer can help), 
and finally solving for the remainder of the destination problem using the transferred infor- 
mation. In this sense knowledge of a partial or coarse solution to one problem can be used 
to quickly learn another, both in terms of computation and, where applicable, exploratory 
experience. 

The paper is organized as follows. In Section [2] we collect preliminary definitions, 
and provide a brief overview of Markov Decision Processes with stochastic policies and 
state/action dependent rewards and discount factors. Section [s] describes partitioning, 
compression and multiscale solution of MDPs. Proofs and additional comments concerning 
computational considerations related to this section are collected in the Appendix. In Sec- 
tion [i] we introduce the multiscale transfer learning framework, and in Section [5] we provide 
examples demonstrating compression and transfer in the context of three different domains 
(discrete and continuous). We discuss and compare related work in Section [6, and conclude 
with some additional observations, comments and open problems in Section 7j 

2. Background and Preliminaries 

The following subsections provide a brief overview of Markov decision processes as well as 
some definitions and notation used in the paper. 

2.1 Markov Decision Processes 



Formally, a Markov decision process (MDP) (see e.g. ( [Putermgin] |1994 ) , (Bertsekas, 2007| )) 



is a sequential decision problem defined by a tuple (5, A, P, i?, F) consisting of a statespace 
5, an action (or "control") set A, and for 5, 5' G 5, a E A, a transition probability tensor 
P(5,a, 5'), reward function R{s,a,s^) and collection of discount factors F(5,a, 5') G (0,1). 
We will assume that 5, A are finite sets, and that R is bounded. The definition above is 
slightly more general than usual in that we allow state and action dependent rewards and 



discount factors; the reason for adopting this convention will be made clear in Section 3.2 



The probability P(5,a, 5') refers to the probability that we transition to upon taking 
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action a in 5, while R{s^ a, s') is the reward cohected in the event we transition from s to s' 
after taking action a in 5. 

2.1.1 Stochastic Policies 

Let V{A) denote the set of ah discrete probabihty distributions on A. A stationary stochas- 
tic policy (simply a policy^ from now on) n : S ^ ^(^) is a function mapping states into 
distributions over the actions. Working with this more general class of policies will allow for 
convex combinations of policies later on. A policy tt may be thought of as a non-negative 
function on S x A satisfying X^a^^7r(5,a) = 1 for each 5^5, where 7r(5,a) denotes the 
probability that we take action a in state s. We will often write ti{s) when referring to 
the distribution on actions associated to the (deterministic) state 5 G 5, so that a ^ ti{s) 
denotes the A- valued random variable a having law 'k{s). Deterministic policies can be 
recovered by placing unit masses on the desired action^ 

We may compute policy-specific Markov transition matrices and reward functions by 
averaging out the actions according to tt: 

P"(s, s') = [P{s, a, s')] = ^(^' ^X^' «) (la) 

R^{s, s') = [R{s, a, s')] = J] R{s, a, s')t:{s, a) . (lb) 

For any pair of tensors X — X{s^ a, s')^Y — Y{s^ a, s') indexed by 5, 5' G 5, a G A, we define 
the matrix {X oYY to be the expectation with respect to tt of the elementwise (Hadamard) 
product between X and Y: 

[{X o YYl^,, := E„^,(,) [X{s, a, s')Y{s, a, s')] = ^(^' s')Y{s, a, s'Ms, a). (2) 

aeA 

Note that {X o Y)^ = {Y o X)^. 

Finally, we will often make use of the uniform random or diffusion policy, denoted tt^, 
which always takes an action drawn randomly according to the uniform distribution on the 
feasible actions. In the case of continuous action spaces, we assume a natural choice of 
"uniform" measure has been made: for example the Haar measure if A is a group, or the 
volume measure if A is a Riemannian manifold. 

2.1.2 Value Functions and the Potential Operator 

Given a policy, we may define a value function : 5 ^ M assigning to each state s the 
expected sum of discounted rewards collected over an infinite horizon by running the policy 
TT starting in s: 

oo rt-1 

R{so, ai, <5i) + ^ < f]^ r(5r, a^+i, Sr+i) > R{st, a^+i, s^+i 
t=i [t=0 J 



So 



(3) 



We will allow the set of actions available in state s to be limited to a nonempty state-dependent subset 
A{s) C A of feasible actions, but do not explicitly keep track of the sets A{s) to avoid cluttering the 
notation. As a matter of bookkeeping, we assume that these constraints are enforced as needed by setting 
P(s,a,s^) = for all sMf a ^ A{s), and/or by assigning zero probability to invalid actions in the case 
of stochastic policies (discussed below). If a stochastic policy has been restricted to the feasible actions, 
then it will be assumed that it has also been suitably re-normalized. 



4 



MuLTiscALE Markov Decision Problems 



where the sequence of random variables (5^)^^^ is a Markov chain with transition probabihty 
matrix . The expectation is taken over all sequences of state-action pairs {(s^, at)}t>i, 
where at is an A- valued random variable representing the action which brings the Markov 
chain to state st from st-i'. if st-i is observed, then at ^ Ti{st-i). Thus, the expectation 
in ([3]) should be interpreted as IEai~7r(5o)^5i~i^(5o,ai,-)^a2~7r(5i) ' ' ' • The state- and action- 
dependent discount factors accrue in a path-dependent fashion leading to the product in ([s]). 
When the discount factors are state dependent, it is possible to define different optimization 
criteria; the choice ([s]) is commonly selected because it defines a value function which may be 
computed via dynamic programming. This choice is also natural in the context of financial 
application^ The optimal value function V is defined as ^"{8) = sup^^n^^l-^) for all 
5^5, where 11 is the set of all stationary stochastic policies, and the corresponding optimal 
policy TT* is any policy achieving the optimal value function. Under the assumptions we have 
imposed here, a deterministic optimal policy exists whenever an optimal policy (possibly 
stochastic) exists (Bertsekas, [2007' Sec. 1.1.4). We will make use of stochastic policies 



primarily to regularize a class of MDP solution algorithms, rather than to achieve better 
solutions. 

The process of computing given tt is known as value determination. Following 
the usual approach, we may solve for by conditioning on the first transition in ([s]) 
and applying the Markov property. However, when tt is stochastic, the first transition 
also involves a randomly selected action, and when the discount factors are state/action 
dependent, the particular discounting seen in ([s]) must be adopted in order to obtain a linear 
system. One may derive the following equation for (details given in the Appendix) 

V^{s) = P{s, a, s')^{s, a) [R{s, a, s') + V{s, a, s')V^{s')\ , s ^ S. (4) 

s' 

In matrix-vector form this system may be written as 

= (/- (roP)^)"V 

where r := (P o i?)^l. The matrix (/ — (F o P)^) ^ will be referred to as the potential 
operator^ or fundamental matrix, or Green's function, in analogy with the naming of the 
matrix (/ — P^)~^ for the Markov chain P^. 

2.2 Notation 

We denote by {X^}^q a Markov chain, not necessarily time- homogeneous, governed by an 
appropriate transition matrix P. For S' C we define the restriction oi P : S x Ax S ^ 
to S' to be the transition tensor Ps' : 5' x A x 5' ^ defined by 

{P{s,a,s') iis.s' ^S'.s^s' 

Ps'\s^ a^s ) — < (5j 
[P(5, a, s) + ^gn^s' ^(^^ ^' -^'0 if 5 = 5^ 5 G 5' . 



2. Consider the present value of an infinite stream of future cash payments ^0, Qi, • • •, paid out at discrete 
time instances t = 0, 1, . . .. If the risk- free interest rate over the period [t, t + 1] is given by rt, then the 
present value of the payments is given by C = go + X^J^i nt=o 7'^^*' where 7^ = (1 + rt)""*^. 
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The rewards from R: SxAxS^M. associated to transitions between states in the subset 
remain unchanged: 

a, s^) = a, 5^), for ah (5, a, s^) such that 5, G 5^ a G A. 

We win refer to this operation as truncation, to distinguish it from restriction as defined 
by ([5]). The sub-tensor Ts^ is similarly defined from T. Note that, by definition, Ps^^Rs'^^S' 
do not include t-uples which start from a state s in the cluster but which end at a state 
outside of the cluster. 

The restriction operation introduced above does not commute with taking expectations 
with respect to a policy. The matrix P^, will be defined by first restricting to by Equa- 
tion ([5]), and then averaging Ps' with respect to tt as in Equation ([T]). Truncation does 
commute with expectation over actions, so i?^/,r^/ may be computed by truncating or 
averaging in any order, although it is clearly more efficient to truncate before averaging. 
In fact, to define these and other related quantities, tt need only be defined locally on 
the cluster of interest. For quantities such as (P^/ o Rs^Y, we will always assume that 
truncation/restriction occurs before expectation. 

Lastly, diag(i;) will denote a diagonal matrix with the elements of vector v along the 
main diagonal, and a Ab will denote the minimum of the scalars a, b. 

3. Multiscale Markov Decision Processes 

The high-level procedure for efficiently solving a problem with a multiscale MDP hierarchy, 
which we will refer to as an '^MMDP^\ consists of the following steps, to be described 
individually in more detail below: 

Step 1 Partition the statespace into subsets of states ("clusters") connected via "bottleneck" 
states. 

Step 2 Given the decomposition into clusters by bottlenecks, compress or homogenize the 
MDP into another, smaller and coarser MDP, whose statespace is the set of bottle- 
necks, and whose actions are given by following certain policies in clusters connecting 
bottlenecks ("subtasks"). 

Repeat the steps above with the compressed MDP as input, for a desired number of 
compression steps, obtaining a hierarchy of MDPs. 

Step 3 Solve the hierarchy of MDPs from the top-down (coarse to fine) by pushing 
solutions of coarse MDPs to finer MDPs, down to the finest scale. 

We say that the procedure above compresses or homogenizes, in a multiscale fashion, a 
given MDP. The construction is perfectly recursive, in the sense that the same steps and 
algorithms are used to homogenize one scale to the next coarser scale, and similarly for the 
refinement steps of a coarse policy into a finer policy. We may, and will, therefore focus 
on a single compression step and in a single refinement step. The compression procedure 
also enjoys a form of consistency across scales: for example, optimal value functions at 
homogenized scales are good approximations to homogenized optimal value functions at 
finer scales. Moreover, actions at coarser scales are typically, as one may expect, complex, 
"higher-level" actions, and the above procedure may be thought of as producing different 
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levels of "abstraction" of the original problem. While automating the process of hierar- 
chically decomposing, in a novel fashion, large complex MDPs, the framework we propose 
may also yield significant computational savings: if at a scale j there are rj clusters of 
roughly equal size, and nj states, the solution to the MDP at that scale may be computed 
in time 0{rj{nj/rj)'^^. If Vj — rij/C and rij — n/C^ (with n being the size of the orig- 
inal statespace), then the computation time across logn scales is 0{n\ogn) . We discuss 



computational complexity in Section 3.4, and establish convergence of a particular solution 



algorithm to the global optimum in Section 3.3.4, Finally, the framework facilitates knowl 



edge transfer between related MDPs. Sub-tasks and coarse solutions may be transferred 
anywhere within the hierarchies for a pair of problems, instead of mapping entire problems. 
We discuss transfer in Section [H 

The rest of this section is devoted to providing details and analysis for Steps (1) — (3) 
above in three subsections. Each of these subsections contains an overview of the construc- 
tion needed in each step followed by a more detailed and algorithmic discussion concerning 
specific algorithms used to implement the construction; the latter may be skipped in a 
first reading, in order to initially focus on "the big picture" . Proofs of the results in these 
subsections are all postponed until the Appendix. 

3.1 Step 1: Bottleneck Detection and statespace Partitioning 

The first step of the algorithm involves partitioning the MDP's statespace S by identifying 
a set ;B C S' of bottlenecks. The bottlenecks induce a partitioning of S\B into a family C of 
connected components. Typically B depends on a policy tt, and when we want to emphasize 
this dependency, we will write B'^ . We always assume that B'^ includes all terminal states 
of . The partitioning of S\B^ induced by the bottlenecks is the set of equivalence classes 
S/r^, under the relation 

Si Sj, if Si, Sj ^ B^ and there is a path from Si to sj not passing through any h ^ B^ . 

Clearly these equivalence classes yield a partitioning of S \ B^. The term cluster will 
refer to an equivalence class plus any bottleneck states connected to states in the class: if 
[5] := {5' I 5 ^ 5^} is an equivalence class, 

c{[s]) := [s] u{heB^ \ P^{s\h) > or P^{h,s') > for some s' G [s]}. 

The set of clusters is denoted by C. If c = ^([5]), [s] will be referred to as the cluster's 
interior, denoted by c, and the bottlenecks attached to [s] will be referred to as the cluster's 
boundary, denoted by dc. 

To each cluster c, and policy tt (defined on at least c), we associate the Markov process 
with transition matrix P^, defined according to Section [2^ 

We also assume that a set of designated policies tTc is provided for each cluster c. For 
example tTc may be the singleton consisting of the diffusion policy in c. Or tTq could be 
the set of locally optimal policies in c for the family of MDPs, parametrized by s^ G dc 
with reward equal to the original rewards plus an additional reward when s^ is reached (this 



approach is detailed in Section 3.3.1) 



3. A partitioning C = {cij^i is a family of disjoint sets Ci such that S — U^iQ. 
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Finally, we say that dc is Tv-reachable, for a policy tt, if the set dc can be reached in a 
finite number of steps of P^, starting from any initial state 5 G c. 

3.1.1 Algorithms for Bottleneck Detection 



In the discussion below we will make use of diffusion map embeddings (Coifman et al. 



2005) as a means to cluster, visualize and compare directed, weighted statespace graphs. 
This is by no means the only possibility for accomplishing such tasks, and we will point out 
other references later. Here we focus on this choice and the details of diffusion maps and 
associated hierarchical clustering algorithms. 

Diffusion maps are based on a Markov process, typically the random walk on a graph. The 
random walks we consider here are of the form for some policy tt, and may always be 
made reversible (by addition of the teleport matrix adding weak edges connecting every pair 
of vertices, as in Step (2) of Algorithm [l]), but may still be strongly directed particularly 
as TV becomes more directed. In light of this directedness, we will compute diffusion map 
embeddings of the underlying states from normalized graph Laplacians symmetrized with 
respect to the underlying Markov chain's invariant distribution, following (Chung, 2005[ ): 



where $ is a diagonal matrix with the invariant distribution satisfying (P^)^/x = ^ placed 
along the main diagonal, = /x^. One can choose an orthonormal set of eigenvectors 
{^^(^)}/e>o with corresponding real eigenvalues Xk which diagonalize C. If we place the 
eigenvalues in ascending order = Aq < • • • < A^, the diffusion map embedding of the state 
Si is given by 

(*f)(l-Afc)),^^, SieS. (7) 

The diffusion distance between two states 5^, sj is given by the Euclidean distance between 
embeddings, 

d\si, sj) = 5^(1 - A,)2|v|>f ) - *f |^ s,, sjeS. 

k>l 



See ( Coifman et al.[ |2005[ ) for a detailed discussion. Often times this distance may be well 



approximated (depending on the decay of the spectrum of C) by truncating the sum after 
p < n terms, in which case only the top p eigenvectors need to be computed. 

In. some cases we will need to align the signs of the eigenvectors for two given Laplacians 
C towards making diffusion map embeddings for different graphs more comparable. If 
{^^(^)}/e>o and {^^^^^}/c>o denote the respective sets of eigenvectors, and the eigenvalues of 
both Laplacians are distinct]^ we can define the sign alignment vector r as 

^^^1+1 if sgn*f) =sgn$f^ 
1—1 otherwise 

Given an alignment vector r, one can extend the above diffusion distance to a distance 
defined on a union of statespaces. If 5, S are statespaces with embeddings ([7| respectively 

4. The case of repeated eigenvalues may be treated similarly by generalizing the sign flipping operation 
r to an orthogonal transformation of the subspace spanned by the eigenvectors sharing the repeated 
eigenvalue. 
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defined by {(^^(^\ Xk)}k=Qi {(^^^^\ ^k)}k=Q foi" some p > I, then we can define the distance 
: (SUS) X (SUS) ^ R+ as 

2 . X ._ j p{si, Sj) if Si G S, Sj E S 
a [Si, Sj) .— < 

I , Si) otherwise. 



using 



with T defined by (l8|) 



= ^(1 - A,)(l - Afc)|rfc*« - 



|2 



Hierarchical clustering. Given a policy tt, we can construct a weighted statespace graph G 
with vertices corresponding to states, and edge weights given by P^. A policy that allows 
thorough exploration, such as the diffusion policy tt^, can be chosen to define the weighted 
statespace graph. 

The hierarchical spectral clustering algorithm we will consider recursively splits the 
statespace graph into pieces by looking for low- conductance cuts. The spectrum of the 



symmetrized Laplacian for directed graphs Chung (2005) is used to determine the graph 



cuts at each step. The sequence of cuts establishes a partitioning of the statespace, and 
bottleneck states are states with edges that are severed by any of the cuts. Algorithm [l] 
describes the process. Other more sophisticated algorithms may also be used, for exam- 



pie Spielman's (Spielman and Teng, 


Peres 


, 2009 




Morris and Peres 


, 2003) 



algorithms above, that only have access to a "black-box" computing the results of running 
a process (truncated random walk, evolving sets process, respectively, for the references 
above), but we do not pursue this here. A recursive application of Algorithm [l] produces a 
set of bottlenecks B^. Each bottleneck and partition discovered by the clustering algorithm 
is associated with a spatial scale determined by the recursion depth. The finest scale con- 
sists of the finest partition and includes all bottlenecks. The next coarser scale includes all 
the bottlenecks and partitions discovered up to but not including the deepest level of the 
recursion, and so on. In this manner the statespaces and actions of all the MDPs in a multi- 
scale hierarchy can be pre-determined, although if desired one can also apply clustering to 
the coarsened statespaces after compressing using the compressed MDP's transition matrix 
as graph weights. The addition of a teleport matrix in Algorithm [l] (Step 2) guarantees 
that the equivalence classes partition {S \B^} and are strongly connected components of 
the weighted graph defined by P^^i- 

Because graph weights are determined by in this algorithm, which bottlenecks will 
be identified generally depends on the policy tt. In this sense there are two types of "bot- 
tlenecks" : problem bottlenecks and geometric bottlenecks. Geometric bottlenecks may be 
defined as interesting regions of the statespace alone, as determined by a random walk 
exploration if tt is a diffusion policy (e.g. tt^). Problem bottlenecks are regions of the 
statespace which are interesting from a geometric standpoint and in light of the goal struc- 
ture of the MDP. If the policy is already strongly directed according to the goals defined by 
the rewards, then the bottlenecks can be interpreted as choke points for a random walker 
in the presence of a strong potential. 
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Algorithm 1 Recursive spectral partitioning. 



1. Restrict to non- absorbing states. 

2. Set P^gj = (1 — 77)P^ + irin~^ll^^ for some small, positive 77. 

3. Find the eigenvector (invariant distribution) ji satisfying [P^^-^'^ji — ji. 

4. Let $ = diag(/i) and compute the symmetrized Laplacian 

5. Compute the K eigenvectors of C corresponding to the K smallest non-trivial eigen- 
values Ai < • • • < \k- 

6. Define a set of cuts Z by sweeping over thresholds ranging from the smallest entry 
of to the largest, for all eigenvectors — 1, . . . The points for which 

are above/below the given threshold defines the states Z^Z^ d S on either side 
of the cut. 

7. Choose the cut = aigminzez ^{Z) with minimum conductance 

m(7\ = ^^^^ EjGZ^ Pjj 

^ vol(Z) A vol(Z^) ' 

where vol(Z) = E.ez E,e^ 

8. Identify bottleneck states as the states in on one (and only one) side of the edges 
in severed by the cut, choosing the side which gives the smallest bottleneck set. 

9. Store the partition of the statespace given by the cut. 

10. Unless stopping criteria is met, run the algorithm again on each of the two subgraphs 
resulting from the cut. 



3.2 Step 2: Multiscale Compression and the Structure of Multiscale Markov 
Decision Problems 

Given a set of bottlenecks B and a suitable fine scale policy, we can compress (or homogenize^ 
or coarsen) an MDP into another MDP with statespace B. The coarse MDP can be thought 
of as a low-resolution version of the original problem, where transitions between clusters are 
the events of interest, rather than what occurs within each cluster. As such, coarse MDPs 
may be vastly simpler: the size of the coarse statespace is on the order of the number of 
clusters, which may be small relative to the size of the original statespace. Indeed, clusters 
may be generally thought of as geometric properties of a problem^ and are constrained by 
the inherent complexity of the problem, rather than the choice of statespace representation, 
discretization or sampling. 

A solution to the coarse MDP may be viewed as a coarse solution to the original fine scale 
problem. An optimal coarse policy describes how to solve the original problem by specifying 
which sub-tasks to carry out and in which order. As we will describe in Section [33} a coarse 
value function provides an efficient means to compute a fine scale value function and its 
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associated policy. Coarse MDPs and their solutions also provide a framework for systematic 
transfer learning; these ideas are discussed in detail in Section [Ij 

We have discussed how to identify a set of bottleneck states in Section [3.1.1| above. As we 
will explain in detail below, a policy is required to compress an MDP. This policy may encode 



a priori knowledge, or may be simply chosen to be the diffusion policy. In Section |3.3.1 
below, we suggest an algorithm for determining good local policies for compression that can 
be expected to produce an MDP at the coarse scale whose optimal solution is compatible 
with the gradient of the optimal value function at the fine scale. 

A homogenized, coarse scale MDP will be denoted by the tuple {S, A, P, i?, F). We first 
give a brief description of the primary ingredients needed to define a coarse MDP, with a 
more detailed discussion to follow in the forthcoming subsections. 

• Statespace S: The coarse scale statespace S is the set of bottleneck states B for the 
fine scale, obtained by cluster ing t he fine scale statespace graph, for example with the 
methods described in Section |3.ll Note that S C S. 

• Action set A: A coarse action invoked from h G S = B consists of executing a given 
fine scale policy ttq G tTq within the fine scale cluster c, starting from b G 5c (at a 
time that we may reset to 0), until the first positive time at which a bottleneck state 
in dc is hit. Recall that in each cluster c we have a set of policies tTc- 

• Coarse scale transition probabilities P{s,a,s^)\ If a G A is an action executing 
the policy tTc G tTc, then P(s, a, s^) is defined as the probability that the Markov chain 
P^^ started from 5 G 5, hits ^ S before hitting any other bottleneck. In particular, 
P(s, a, s^) may be nonzero only when s, G dc for some c G C. 

• Coarse scale rewards R(s, a, s^): The coarse reward R{s, a, s^) is defined to be the 
expected total discounted reward collected along trajectories of the Markov chain 
associated to action a described above, which start at 5 G 5 and end by hitting ^ S 
before hitting any other bottleneck. 

• Coarse scale discount factors r(5,a, 5^): The coarse discount factor r{s,a,s^) is 
the expected product of the discounts applied to rewards along trajectories of the 
Markov chain P^^ associated to a action a ^ A, starting at 5 G 5 and ending at 

G S. 

One of the important consequences of these definitions is that the optimal fine scale 
value function on the bottlenecks is a good solution to the coarse MDP, compressed with 
respect to the optimal fine scale policy, and vice versa. It is this type of consistency across 
scales that will allow us to take advantage of the construction above and design efficient 
multiscale algorithms. 

The compression process is reminiscent of other instances of conceptually related proce- 
dures: coarsening (applied to meshes and PDEs, for example), homogenization (of PDEs), 
and lumping (of Markov chains). The general philosophy is to reduce a large problem to a 
smaller one constructed by "locally averaging" parts of the original problem. 

The coarsening step may always be accomplished computationally by way of Monte 
Carlo simulations, as it involves computing the relevant statistics of certain functionals of 
Markov processes in each of the clusters. As such, the computation is embarrassingly paral- 
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leQ While this gives flexibihty to the framework above, it is interesting to note that many 
of the relevant computations may in fact be carried out analytically, and that eventually 
they reduce to the solution of multiple independent (and therefore trivially parallelizable) 
small linear systems, of size comparable to the size of a cluster. In this section we develop 
this analytical framework in detail (with proofs collected in the Appendix), as they un- 
cover the natural structure of the multiscale hierarchy we introduce, and lead to efficient, 
"explicit" algorithms for the solution of the Markov decision problems we consider. The 
rest of this section is somewhat technical, and on a first reading one may skip directly to 



Section 3.3 where we discuss the multiscale solution of hierarchical MDPs obtained by our 



construction. 

3.2.1 Assumptions 

We will always assume that the fine scale policy tt used to compress has been regularized^ 
by blending with a small amount of the diffusion policy tt^: 

7r(5, •) ^ A^^(5, •) + (!- A)^(5, .), seS 

for some small, positive choice of the regularization parameter A. In particular we will 
assume this is the case everywhere quantities such as appear below. The goal of this 
regularization is to address, or partially address, the following issues: 

• The solution process may be initially biased towards one particular (and possibly 
incorrect) solution, but this bias can be overcome when solving the coarse MDP as 
long as the regularization described above is included every time compression occurs 



during the iterative solution process we will describe in Section 3.3 



• Directed policies can yield a fine scale transition matrix which, when restricted to a 
cluster, may render bottleneck (or other) states unreachable. We require the boundary 
of each cluster to be vr-reachable, and this is often guaranteed by the regularization 
above except in rather pathological situations. If any interior states violate this con- 
dition, they can be added to the cluster's boundary and to the global bottleneck set 
at the relevant scal^ We note that these assumptions are significantly weaker than 
requiring that the subgraphs induced by the restrictions P^, tt G tTc of P to a cluster 
are strongly connected components; the Markov chain defined by P^ need not be 
irreducible. 

• Compression with respect to a deterministic and/or incorrect policy should not pre- 
clude transfer to other tasks. In the case of policy transfer, to be discussed below, 
errors in a policy used for compression can easily occur, and can lead to unreachable 
states. Policy regularization helps alleviate this problem. 



5. Moreover, it does not require a priori knowledge of the fine details of the models in each cluster, but 
only requires the ability to call a "black box" which simulates the prescribed process in each cluster, and 
computes the corresponding functional (in this sense coarsening becomes model-free). 

6. In fact, if any such state is an element of a closed, communicating class, then the entire class can be 
lumped into a single state and treated as a single bottleneck. Thus, the bottleneck set does not need to 
grow with the size of the closed class from which the boundary is unreachable. For simplicity however, 
we will assume in the development below that no lumped states of this type exist. 
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3.2.2 Actions 

An action A at 5 G 5 for the compressed MDP consists of executing a policy tTc G tTc at 
the fine scale, starting in 5, within some cluster c having s on its boundary, until hitting a 
bottleneck state in c. The number of actions is equal to the total number of policies across 
clusters, I A I = X^cgC I'^cI- We now fix a cluster c and a policy tTc G tTc- The corresponding 
local Markov transition matrix is P^^, and let R^^ denote the reward structure, and FJ^ 



denote the system of discount factors, following Section 2.2, Let ((X^^)^)^>o denote the 
Markov chain with transition matrix P^^. If the coarse action is invoked in state 5 G 5, 
then we set Xq = 5. The set of actions available at 5 G 5 for the compressed MDP is given 
by 

A{s) := I J ( "run the MRP {P^% Rc^^c") in c until the first n > : {X^')n ^ >B" } 

A Markov reward process (MRP) refers to an MDP with a fixed policy and corresponding 
P, i?, r restricted to that fixed policy. The actions above involve running an MRP because 
while the action is being executed the policy remains fixed. 

Consider the simple example shown on the left in Figure [TJ where we graphically depict 
a simple coarse MDP (large circles and bold edges) superimposed upon a stylized fine 
scale statespace graph (light gray edges; vertices are edge intersections). Undirected edges 
between coarse states are bidirectional. Dark gray lines delineate four clusters, to which 
we have associated fine scale policies 7ri,...,7r4. The bottlenecks are the states labeled 
5i, . . . , 54. If an agent is in state 5i, for example, the actions "Execute tti" and "Execute 
7r4" are feasible. If the agent takes the coarse action a = "Execute tti", then it can either 
reach 52 or come back to 5i, since this action forbids traveling outside of cluster 1 (top right 
quadrant). On the other hand if tt^ is executed from 5i, then the agent can reach 54 or 
return to si, but the probability of transitioning to 52 is zero. 

In general, the compressed MDP will have action and state dependent rewards and 
discount factors, even if the fine scale problem does not. In Figure [l] (left), the coarse 
states straddle two clusters each, and therefore have different self loops corresponding to 
paths which return to the starting state within one of the two clusters. So R and F ap- 
parently depend on actions. But, we may reach si when executing tti starting from either 
si or from 52, so the compressed quantities in fact depend on both the actions and the 
source/destination states. Figure [l] (right) shows another example, where the dependence 
on source states is particularly clear. Even if the action corresponding to running a fine 
policy in the center square is the same for all states, each coarse state 5i, . . . , 54 may be 
reached from two neighbors as well as itself. 

3.2.3 Transition Probabilities 

Consider the cluster c referred to by a coarse action a G A. The transition probability 
P(s, a, s^) for s, ^ dc C S is defined as the probability that a trajectory in c d S hits state 
s' starting from s before hitting any other state in B (including itself) when running the 
fine scale MRP restricted to c and along the policy determined by the action a. 

If 5 is a state not in the cluster associated to a, then a is not a feasible action when 
in state s. For the example shown in Figure [l] (left), for instance, the edge weight con- 
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Figure 1: (Left) Graphical depiction of a simple coarse MDP. The states 5i,...,54 enclosed in 
circles are bottleneck states, and act as gateways between the four clusters. The bottleneck 
states each connect bidirectionally to two other neighboring bottleneck states as well as 
to themselves (thin black edges). Two self loops are shown per state to emphasize that 
coarse self transitions can be different depending on the action taken. (Right) Another 
example emphasizing that coarse probabilities, rewards and discounts are in general both 
state and action dependent. See text for details. 



necting si and 52 is the probability that a trajectory reaches 52 before it can return to 5i, 
when executing tti. These probabihties may be estimated either by samphng (Monte Carlo 
simulations), or computed analytically. The first approach is trivially implemented, with 
the usual advantages (e.g. parallelism, access to a black-box simulator is all is needed) and 
disadvantages (slow convergence); here we develop the latter, which leads to a concise set of 
linear problems to be solved, and sheds light on both the mathematical and computational 
structure. Since the bottlenecks partition the statespace into disjoint sets, the probabilities 
P{s, a, s^) can be quickly computed in each cluster separately. 

Proposition 1 Let a be the action corresponding to executing a policy in cluster c. Then 

P(s^ a, s^) = i?s,s/, for all 5, E 5c, 
where LL is the minimal non-negative solution^ for each G dc, to the linear system 

H,^,, = P^^is, s') + J2 Pc'i^^ s")H,„^,,, sec,s'edc, (9) 

or in matrix-vector form, 

{I - P-^{I - J))H = P^^ 

where J is a matrix of all zeros except Js"s" — 1 for s" G dc. 
Corollary 2 Consider the partitioning 
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where the blocks Q, D describe the interaction among non-bottleneck and bottleneck states 
within cluster c respectively. The compressed probabilities may be obtained by finding the 
minimal non-negative solution to 

{I-Q)hg^B 

followed by computing 

ht = D + Chq, 

where h^, is the transition probability matrix of the compressed MDP given the action a. 

The proof of this Proposition and a discussion are given in the Appendix. 

When deriving the the compressed rewards and discount factors below, we wih need 
to reference the set of ah pairs of bottlenecks 5, s' for which the probability of reaching s' 
starting from s is positive, when executing the policy tTc associated to a. Having defined P, 
this set may be easily characterized as 

supp^(P) := {{s,s') G dc \ P{s,a,s') > 0} 

where c is the cluster associated to the coarse action a. 

3.2.4 Rewards 

The rewards R — R{s^a^s'), with s^s' G dc and a ^ A, are defined to be the expected 
discounted rewards collected along trajectories that start from s and hit s' before hitting any 
other bottleneck state in dc, when running the fine-scale MRP restricted to the cluster c 
associated to a. 

In general, rewards under different policies and/or in other clusters are calculated by 
repeating the process described below for different choices of tt E tTc, c G C. As was the case 
in the examples shown in Figure [TJ even if the fine scale MDP rewards do not depend on 
the source state or actions, the compressed MDP's rewards will, in general, depend on the 
source, destination and action taken. However as with the coarse transition probabilities, 
the relevant computations involve at most a single cluster's subgraph at a time. 

Given a policy tTc on cluster c, consider the Markov chain (X^)t>o with transition matrix 
P^''. Let T and T' be two arbitrary stopping times satisfying <T <T' < oo (a.s.). The 
discounted reward accumulated over the interval T <t <T' is given by the random variable 

T'-l Vt-1 
t=T^l \_r=T 

where a^+i ~ 7ic{Xt) for t = T, . . . , — 1, and we set R^ = for any T. 
Next, define the hitting times of dc\ 

Tm = inf{t > T^_i I Xt G ^c}, m = 1,2, . . . 

with To = inf{t > I G dc}. Note that if the chain is started in a bottleneck state 
Xq = b G 5c, then clearly Tq = 0. We will be concerned with the rewards accumulated 
between these successive hitting times, and by the Markovianity of {Xt)t, we may, without 
loss of generality, consider the reward between Tq and Ti, namely R^K 

The following proposition describes how to compute the expected discounted rewards 
by solving a collection of linear systems. 
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Proposition 3 Suppose the coarse scale action a ^ A corresponds to executing a policy 
TVc in cluster c, and let {Xt)t>o denote the Markov chain with transition matrix P^^ . The 
rewards R at the coarse scale may he characterized as 

R{s, a, s') = ¥.s[RI' I Xt, = s% (5, s') G suppjP) . 

Moreover, for fixed a, R{s,a,s^) =: Hg^s' '^ciy be computed hy finding the (unique, hounded) 
solution H to the linear system 



(10a) 



5]P^,(s,a,s")r(s,a,s")i?.",.'+ J] , (s, a, a, s"), if (s, s') e supp,{P) (10b) 

aeA 



s' 

aeA 



where c'^, := {5 G c | hs'{s) > 0}; 

P/,^, (5, a, 

for 5 G c n c^/; a ^ A, G d^,; and 



Pc(g,a, g^07rc(g,a)/ig/(g^0 

Pc(g,a, g^07rc(g,a)/ig/(g^0 
P(5,a,50 



for (5,5^ G supp^(P), a G A, G c'^,; with hsf{s) := Ps(Xto = 5^), /or 5 G c, 5' G 5c 
denoting the minimal, non-negative, harmonic function satisfying 



Sg^s^ 5 G 5c 

Pc'is, s') + J:,.^o p-c^s, s")hAs") sec. 



Thus, for each 5' G 5c, the compressed rewards R{s, a, 5') are computed by first solving 
a linear system of size at most |c| x |c| given by ( 10a[ ), and then computing at most |5c| 
of the sums given by ( 10b| ) (the function /i^/ has already been computed during course of 
solving for the compressed transition probabilities). 

See the Appendix for a proof, a matrix formulation of this result, and a discussion 
concerning computational consequences. 

3.2.5 Trajectory Lengths 



Assume the hitting times (T^)m>o are as defined in Section 3.2.4 We note that the average 
path lengths (hitting times) between pairs of bottlenecks, 

L(5, a, 5') := ¥.s[Ti \ Xt^ = s]^ 5, 5' G 5c such that (5, 5') G supp^(P) 

can be computed using the machinery exactly as described in Section [3.2.4 above by setting 

r(5, a, 5') = 1, R{s, a, 5') = 1, for all 5, 5' G 5, a G A 
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at the fine scale, and then applying the calculations for computing expected "rewards" given 
by Equations (10a) and (10b) subject to a non-negativity constraint. Although expected 
path lengths are not essential for defining a compressed MDP, these quantities can still 
provide additional insight into a problem. For example, when simulations are involved, 
expected path lengths might hint at the amount of exploration necessary under a given 
policy to characterize a cluster. 



3.2.6 Discount Factors 

When solving an MDP using the hierarchical decomposition introduced above, it is im- 
portant to seek a good approximation for the discount factors at the coarse scale. In our 
experience, this results in improved consistency across scales, and improved accuracy of 
and convergence to the solution. In the preceding sections, a coarse MDP was computed 
by averaging over paths between bottlenecks at a finer scale. Depending on the particular 
source/destination pair of states, the paths will in general have different length distribu- 
tions. Thus, when solving a coarse MDP, rewards collected upon transitioning between 
states at the coarse scale should be discounted at different, state-dependent rates. The 
correct discount rate is a random variable (as is the path length), and transitions at the 
coarse scale implicitly depend on outcomes at the fine scale. We will partially correct for 
differing length distributions (and avoid the need to simulate at the fine scale) by imposing 
a coarse non-uniform discount factor based on the cumulative fine scale discount applied on 
average to paths between bottlenecks. The coarse discount factors F are incorporated when 
solving the coarse MDP so that the scale of the coarse value function is more compatible 
with the fine problem, and convergence towards the fine-scale policy may be accelerated. 

The expected cumulative discounts may be computed using a procedure similar to the 
one given for computing expected rewards in Section [3. 2. 4[ As before, given a policy tTc on 
cluster c, consider the Markov chain (X^)^>o with transition matrix PJ^^, and let T^T' be 
two arbitrary stopping times satisfying < T < < oc (a.s.). The cumulative discount 
applied to trajectories {Xt^Xt^i^ • . . ^^^^ interval T < t < T' \s given by the 

random variable 

T'-l 
t=T 



where a^+i ^ T^ci^t) for t = T, . . . , — 1. The following proposition describes how to 
compute the expected discount factors by solving a collection of linear problems with non- 
negativity constraints. 



Proposition 4 Suppose the coarse scale action a ^ A corresponds to executing the policy 
Tic cluster c. Let (X^)^>o denote the Markov chain with transition matrix P^^ , and let 
{Tm)m>o denote the boundary hitting times defined in Section 3.2.4- 
F at the coarse scale may be characterized as 



The discount factors 



r(s, a, s') = E,[A^i I Xt, = s'], (s, s') e supp„(P) 
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Figure 2: Different solution algorithms for solving a pair of coarse/fine MDPs are obtained by 
iterating over different paths in this flow graph. See text for details. 



Hs,s' 



and, letting Hg s' '•— r(5,a, 5')^ may be computed by finding the minimal non-negative solu- 
tion H to the linear system 

Ph^, (5, a, s')V{s, a, s')Hs"^s' + Ph^> (s, a, s')V{s, a,s'), ifsecH 4, 

s' 

J2 % (s, a, s")ris, a, s")H,.^,, + % «' ^')r(s, a, s'), if {s, s') G supp„(P) 

s"einc',,a&A 

where hs'{s), c'^,, Ph^n Pji , ^"^^ defined in Proposition^^ 

The proof is again postponed until the Appendix. 

It is worth observing that if the discount factor at the fine scale is uniform, T(s, a, 5') = 7 
with no dependence on states or actions, then the expected cumulative discounts may be 
related to the average path lengths L(5,a, 5') between pairs of bottlenecks described in 



Section 3.2.5 In particular, suppose T{s,a,s^) is the first passage time of a fine scale 
trajectory starting at 5 G 5c and ending at G dc following a policy determined by the 
coarse action a ^ A. Then L(5, a, s^) = K[T{s, a, 5')], and Jensen's inequality implies that 

Thus r(5,a, 5') > 7^[^(^'^'^0] — ryL{s,a,s') ^ ^j^jg approximation improves as 7 t 1- How- 
ever, this is only true for uniform 7 at the fine scale, and even for 7 close to 1, the relationship 
above may be loose. Although the connection between path lengths and discount factors is 
illuminating and potentially useful in the context of Monte-Carlo simulations, we suggest 
calculating coarse discount factors according to Proposition [4] rather than through path 
length averages. 

In this and previous subsections, the approach taken is in the spirit of revealing the 
structure of the coarsening step and how it is possible to compute many coarser variables, 
or approximations thereof, by solutions of linear systems. Of course one may always use 
Monte-Carlo methods, which in addition to estimates of the expected values, may be used 
to obtain more refined approximations to the law of the random variables A^J and . 

3.3 Step 3: Multiscale Solution of MDPs 

Given a (fine) MDP and a coarsening as above, a solution to the fine scale MDP may be 
obtained by applying one of several possible algorithms suggested by the flow diagram in 
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Figure [2j Solving for the finer scale's policy involves alternating between two main compu- 
tational steps: (1) updating the fine solution given the coarse solution, and (2) updating the 
coarse solution given the fine solution. Given a coarse solution defined on bottleneck states, 
the fine scale problem decomposes into a collection of smaller independent sub-problems, 
each of which may be solved approximately or exactly. These are iterations along the inner 
loop surrounding "update fine" in Figure [2j After the fine scale problem has been updated, 
the solution on the bottlenecks may be updated either with or without a re-compression 
step. The former is represented by the long upper feedback loop in Figure [2l while the 
latter corresponds to the outer, lower loop passing through "update boundary". Updating 
without re-compressing may, for instance, take the form of the updates (Bellman, averag- 
ing) appearing in any of the asynchronous policy/value iteration algorithms. Updating by 
re-compression consists of re-compressing with respect to the current, updated fine policy 
and then solving the resulting coarse MDP. 

The discussion that follows considers an arbitrary pair of successive scales, a "fine scale" 
and a "coarse scale", and applies to any such pair within a general hierarchy of scales. A 
key property of the compression step is that it yields new MDPs in the same form, and 
can therefore be iterated. Similarly, the process of coarsening and refining policies and 
value functions allow one to move from fine to coarse scales and then from coarse to fine, 
respectively, and therefore may be repeated through several scales, in different possible 
patterns. 

In a problem with many scales, the hierarchy may be traversed in different ways by 
recursive applications of the solution steps discussed above. A particularly simple approach 
is to solve top-down (coarse to fine) and update bottom-up. In this case the solution to 
the coarsest MDP is pushed down to the previous scale, where we may solve again and 
push this solution downwards and so on, until reaching a solution to the bottom, original 
problem. It is helpful, though not essential, when solving top-down if the magnitude of 
coarse scale value functions are directly compatible with the optimal value function for the 
fine-scale MDP. What is important, however, is that there is sufficient gradient as to direct 
the solution along the correct path to the goal(s), stepping from cluster-to-cluster at the 
finest scale. In Algorithm |2| solving top-down will enforce the coarse scale value gradient. 
One can mitigate the possibility of errors at the coarse scale by compressing with respect to 
carefully chosen policies at the fine scale (see Section 3.3.1). However, to allow for recovery 
of the optimal policy in the case of imperfect coarse scale information, a bottom-up pass 
updating coarse scale information is generally necessary. Coarse scale information may be 
updated either by re-compressing or by means of other local updates we will describe below. 

Although we will consider in detail the solution of a two layer hierarchy consisting of 
a fine scale problem and a single coarsened problem, these ideas may be readily extended 
to hierarchies of arbitrary depth: what is important is the handling of pairs of successive 
scales. The particular algorithm we describe chooses (localized) policy iteration for fine- 
scale policy improvement, and local averaging for updating values at bottleneck states. 
Algorithm [2] gives the basic steps comprising the solution process. The fine scale MDP is 
first compressed with respect to one or more policies. In the absence of any specific and 
reliable prior knowledge, a collection of cluster-specific stochastic policies, to be described 
in Section 3.3. 1| is suggested. This collection attempts to provide all of the coarse actions 
an agent could possibly want to take involving the given cluster. These coarse actions 
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Algorithm 2 Top-down solution of MDPs: Alternating interior-boundary approach. 



Set the initial fine scale policy to random uniform if not otherwise given via transfer. 

1. Compress the MDP using one or more policies. 

2. Solve the coarse MDP using any algorithm, and save the resulting value function 

hoarse • 

3. Fix the value function Vfine of the fine MDP at bottleneck states B to Koarse- 

4. Solve the local boundary value problems separately within each cluster to fill in the 
rest of Vfine 5 given the current fine scale policy. 

o o 

5. Recover a fine scale policy n : S x A ^ on cluster interiors {S S \ B) from 

o 

the resulting Vfine- For 5^5, 

a* (5) argmax P{s, a, s') {R{s, a, s') + r{s, a, s')yfi„e(sO) (Ha) 

7r(s, •) = <^a*(s) • (lib) 

o 

6. Blend in a regularized fashion with the previous global policy. For 5^5, 

where A G (0, 1] is a regularization parameter. 

7. (Optional - Local Policy Iteration) Set TToid = ^new Repeat from step Q until 
convergence criteria met. 



8. Update the fine pohcy on bottleneck states by applying Equations (11)-(12) for 
seB. 

9. Update the boundary states' values either exactly, or by repeated local averaging, 



Vfine (5) = IE^-7rnew(s) 



,seB 



where the number of averaging passes N for each bottleneck state s ^ B, satisfies 
> log;^ I with 7 := max^^^^^/ {r(5, a, 50l[P(5,a,50>o] }• 
10. Set 

^oid — ^new Repeat from step Q until convergence criteria met. 



involve traversing a particular cluster towards each bottleneck along paths which vary in 
their directedness, depending on the reward structure within the cluster. The Algorithm 
is local to clusters, however, so computing these policies is inexpensive. Moreover, if given 
policies defined on one or more clusters a priori, then those policies may be added to the 
collection used to compress, providing additional actions from which an agent may choose 
at the coarse scale. Solving the coarse MDP amounts to choosing the best actions from the 
available pool. 
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The next step of Algorithm [2] is to solve the coarse MDP to convergence. Since the coarse 
MDP may itself be compressed and solved efficiently, this step is relatively inexpensive. The 
optimal value function for the coarse problem is then assigned to the set of bottleneck states 
for the fine problen|^ With bottleneck values fixed, policy iteration is invoked within each 
cluster independently (Steps Q-Q). These local policy iterations may be applied to the 
clusters in any order, and any number of times. The value determination step here can be 
thought of as a boundary value problem, in which a cluster's boundary (bottleneck) values 



are interpolated over the interior of the cluster. Section |3.3.2| explains how to solve these 
problems as required by Step Q of Algorithm [2j Note that only values on the interior of a 
cluster are needed so the policy does not need to be specified on bottlenecks for local pohcy 
iteration. 

Next, a greedy fine scale policy on a cluster's interior states is computed from the 
interior values (Step Q). The new interior policy is a blend between the greedy policy and 
the previous policy (Step Q). Starting from an initial stochastic fine scale policy, policy 
blending allows one to regularize the solution and maintain a degree of stochasticity that 
can repair coarse scale errors. 

Finally, information is exchanged between clusters by updating the policy on bottleneck 
states (Step Q), and then using this (globally defined) policy in combination with the 
interior values to update bottleneck values by local averaging (Step Q). Both of these steps 
are computationally inexpensive. Alternating updates to cluster interiors and boundaries 
are executed until convergence. This algorithm is guaranteed to converge to an optimal value 



function/policy pair (it is a variant of modified asynchronous policy iteration, see (Bertsekas 



2007)), however in general convergence may not be monotonic (in any norm). Section 3.3.4 
gives a proof of convergence for arbitrary initial conditions. 

We note that often approximate solutions to the top level or cluster-specific problems 
is sufficient. Empirically we have found that single policy iterations applied to the clusters 
in between bottleneck updates gives rapid convergence (see Section [5]). We emphasize that 
at each level of the hierarchy below the topmost level, the MDP may be decomposed into 
distinct pieces to be solved locally and independently of each other. Obtaining solutions at 
each scale is an efficient process and at no point do we solve a large, global problem. 

In practice, the multi-scale algorithm we have discussed requires fewer iterations to 
converge than global, single-scale algorithms, for primarily two reasons. First, the multiscale 
algorithm starts with a coarse approximation of the fine solution given by the solution 
to the compressed MDP. This provides a good global warm start that would otherwise 
require many iterations of a global, single-scale algorithm. Second, the multiscale treatment 
can offer faster convergence since sub-problems are decoupled from each other given the 
bottlenecks. Convergence of local (within cluster) policy iteration is thus constrained by 
what are likely to be faster mixing times within clusters, rather than slow global times 
across clusters, as conductances are comparatively large within clusters by construction. 

3.3.1 Selecting Policies for Compression 

In the context of solving an MDP hierarchy, the ideal coarse value function is one which 
takes on the exact values of the optimal fine value function at the bottlenecks. Such a 



7. Recall that the statespace of the coarse problem is exactly the set of bottlenecks for the fine problem. 
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Algorithm 3 Determining good policies to be used for compression. 



for each cluster c do 
Set Sc to be cluster c 
for each bottleneck b G 5c do 

Set Pc,b to be Pc modified so that b is absorbing 

Set Rq b r ~ 

for each r G i?intc do 

^c,b,r(<5, a, b) <r- Rc{s^ a, b) + Vc{s^ a, b)r for all 5 G c, a G A 
Solve MDPc,b,r — (5'c,^,i^c,b,^c,b,r,rc) for a policy 7rc,b,r on cluster c 
end for 
end for 
end for 



value function clearly respects the fine scale gradient, falls within a compatible dynamic 
range, and can be expected to lead to rapid convergence when used in conjunction with 
Algorithm [2j Indeed, the best possible coarse value function that can be realized is precisely 
the solution to a coarse MDP compressed with respect to the optimal fine scale policy. We 
propose a local method for selecting a collection of policies at the fine scale that can be used 
for compression, such that the solution to the resulting coarse MDP is likely to be close to 
the best possible coarse solution. 

Algorithm [3] summarizes the proposed policy selection method, and is local in that only 
a single cluster at a time is considered. The idea behind this algorithm is that useful 
coarse actions involving a given cluster generally consist of getting from one of the cluster's 
bottlenecks to another. The best fine scale path from one bottleneck to another, however, 
depends on the reward structure within the cluster. In fact, if there are larger rewards 
within the cluster than outside, it may not even be advantageous to leave it. On the other 
hand, if only local rewards within a cluster are visible, then we cannot tell whether locally 
large rewards are also globally large. Thus, a collection of policies covering all the interesting 
alternatives is desired. 

For cluster c, let f := max5^c,a,5^Gc ^c(5, a, 5^, r := min5^c,a,5^Gc ^c(5, a, 5^), and 7 = 
max5^c,a,s^Gc rc(5, a, 5'). Let diam(c) denote the longest graph geodesic between any two 
states in cluster c. Then for each bottleneck b G >B H c, and any choice of r G i?intc, where 

2 _ ;ydiam(c) 

i?intc = z [min{0,r},max{0,f}], 

we consider the following MDPc,b,r •= (S'c, ^5 ^c,b5 ^c,b,r5 Tc): 

• The statespace Sc is c; 

• The transition probability law Pc,b is the transition law of the original MDP restricted 
to c, but modified so that b is an absorbing stat^for P^^^ regardless of the policy tt; 

• The rewards i?c,b,r5 foi" fixed b, are the rewards R of the original MDP truncated to c, 
with the modification i?c,b,r(<55 b) = a, b) + r(5, a, b)r, for all 5 G c and a G A; 

8. If the cluster already contains absorbing (terminal) states, then those states remain absorbing (in addition 
to b). 
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• The discount factors Tc are the discount factors of the original MDP truncated to c. 

The optimal (or approximate) policy tt*^^ of each MDPc,b,r is computed. As r ranges in 
a continuous interval, we expect to find only a small numbei|^ of distinct optimal policies 
{^c b ri^^^discc b' cach fixed b, where i?disCc,b is the set of corresponding rewards placed 
at b giving rise to the distinct policies. Therefore the cardinality of this set of policies is 
^(Z^bG^c l^di^^c,b|)- The set of policies tTc := ^bedc{7^c,b,r}reRdiscc,b is our candidate for 
the set of actions available at the coarser scale when the agent is at a bottleneck adjacent 



to cluster c, and for the set of actions which was assumed and used in Sections |3. 1| and [3^ 
Finally, we note that Algorithm |3] is trivially parallel, across clusters and across bot- 
tlenecks within clusters. In addition, solving for each policy is comparatively inexpensive 
because it involves a single cluster. 

3.3.2 Solving Localized Sub-Problems 

Given a solution (possibly approximate) to a coarse MDP in the form of a value function 
Koarse, one Can efficiently compute a solution to the fine-scale MDP by fixing the values at 
the fine scale's bottlenecks to those given by the coarse MDP value function. The problem 
is partitioned into clusters where we can solve locally for a value function or policy within 
each cluster independently of the other clusters, using a variety of MDP solvers. Values 
inside a cluster are conditionally independent of values outside the cluster given the cluster's 
bottleneck values. 

As an illustrative example we show how policy iteration may be applied to learn the 
policies for each cluster. Let tTc be an initial policy at the fine scale defined on at least c. 
Determination of the values on c given values on dc amounts to solving a boundary value 
problem: a continuous- domain physical analogy to this situation is that of solving a Poisson 
boundary value problem with Neumann boundary conditions. The connection with Poisson 
problems is that if P^^ is the transition matrix of the Markov chain (X^)^>o following tTc 
in cluster c, then we would like to compute the function 

Vis) E [R^° + Koarse(Xro )\Xo^s], set, 

where Tq := inf{n > | Xn G dc} is the first passag e ti me of t he boundary of cluster c, and 



i?Q°, Aq° are respectively defined in Sections 3.2.4 and 3.2.6, It can be shown that V{s 



is unique and bounded under our usual assumption that the boundary dc be TTc-reachable 
from any interior state 5 G c. The value function we seek is computed by solving a linear 
system similar to Equation Q. We have. 



V{s) 



Koarse(5) if 5 G 5c 

J2s'ec,a'eA Pc{s, a, s')7Tc{s, a) [R{s, a, s') + r(5, a, s')V{s')] if 5 G c 



where Pc is the restriction of P to c defined by Equation ([s]). It is instructive to consider 
a matrix- vector formulation of this system. Let Rc,^c denote the respective truncation of 
a, 5^), r(5, a, 5^) to the triples {(5,a, 5^) | 5 G c, 5' G c, a G A}. Defining the quantities 



9. which may be found by bisection search of Rc 
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(Pc o RqY^i iXc o ^c)^^ using ([2]), assume the partitioning 

where interactions among bottlenecks attached to cluster c are captured by B labeled blocks 
and interactions among non-bottleneck interior states by Q labeled blocks. Fix V5 = Koarse, 
and let Vq denote the (unknown) values of the cluster's interior states. The value function 
on interior states Vq is given by 

Vq=[Dr Qr]l + [D Q] 

so that we must solve the |c| x |c| linear system 

{I-Q)Vq^[Dr Qr]l^DV^. (13) 

Given a value function for a cluster, the policy update step is unchanged from vanilla 
policy iteration except that we do not solve for the policy at bottleneck states: only the 
policy restricted to interior states is needed to update the Q and D blocks of the matrices 
above, towards carrying out another iteration of value determination inside the cluster 
(the blocks C, D are not needed) . This shows in yet another way that policy iteration 
within a given cluster is completely independent of other clusters. When policy iteration 
has converged to the desired tolerance within each cluster independently, or if the desired 
number of iterations has been reached, the individual cluster-specific value functions may 
be simply concatenated together along with the given values at the bottlenecks to obtain a 
globally defined value function. 

As mentioned above, solving for a value function on a cluster's interior does not require 
the initial policy to be defined on bottleneckj^ however a pohcy on bottleneck states 
can be quickly determined from the global value function. This step is computationally 
inexpensive when bottlenecks are few in number and have comparatively low out-degree. 
A policy defined on cluster interiors is obtained either from the global value function, or 
automatically during the solution process if, for example, a policy-iteration variant is used. 

3.3.3 Bottleneck Updates 

Given any value function V on cluster interiors and any globally defined policy tt, values at 
bottleneck states may be updated using similar asynchronous iterative algorithms: we hold 
the value function fixed on all cluster interiors, and update the bottlenecks. Combined with 
interior updating, this step comprises the second half of the alternating solution approach 
outlined in Algorithm [2j 

Local averaging of the values in the vicinity of a bottleneck is a particularly simple 
update, 

V{s) <- ^ P(5, a, s)ti{s, a) {R{s, a, s) + r{s, a, s)V{s)) , s e B. 
10. We will discuss why this situation could arise in the context of transfer learning (Section 4|. 
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Value iteration and modified value iteration variants may be defined analogously. Value 
determination at the bottlenecks may be characterized as follows. Consider the (global) 
quantities (P o i?)^, (r o P)^ (these do not need to be computed in their entirety), and the 
partitioning 

where, as before, interactions among bottlenecks are captured by B labeled blocks and 
interactions among non-bottleneck (interior) states by Q labeled blocks. Let Vq be held 
fixed to the known interior values, and let V5 denote the unknown values to be assigned to 
bottlenecks. Then V5 is obtained by solving the \B\ x \B\ hnear system 

{I-B)Vt=[Br Cr]l + CVq. (14) 

In the ideal situation, by virtue of the spectral clustering Algorithm [TJ the blocks C and 
Cr are likely to be sparse (bottlenecks should have low out-degree) so the matrix-vector 
products CVq and C^l are inexpensive. Furthermore, even though \B\ is already small, 
by similar arguments bottlenecks should not ordinarily have many direct connections to 



other bottlenecks, and B is hkely to be block diagonal. Thus, solving (14) is hkely to be 
essentially negligible. 

3.3.4 Proof of Convergence 



Algorithm [2] is an instance of modified asynchronous policy iteration (see (Bertsekas, 2007) 



for an overview), and can be shown to recover an optimal fine scale policy from any initial 
starting point. 

Theorem 5 Fix any initial fine- scale policy ttq; and any collection of compression policies 
{tTcIcgC such that for each c G dc is iTc-reachable for all tTc E tTc- Let denote the 
global fine scale value function after k > passes of Steps (4\)-p^l Algorithm^ For an 
appropriate number of updates N per bottleneck per algorithm iteration satisfying 

N > log^ i (15) 

with 7 := max5^a,s^{r(5, a, 5')l[p(5^^ 5/)>o]}; sequence generated by the alternating 
interior-boundary policy iteration Algorithm\^ satisfies 

lim max|y*(5)-y^(5)| = 
where V is the unique optimal value function. 

Proof We first note that the value function updates in Algorithm [2j on both interior and 
boundary states at the fine scale may be written as one or more applications of (locally 
defined) averaging operators T^^ of the form 

(T^V) (s) = J2 ^(^' ^)^(^' ^' ^0 (^(^' ^' ^0 + r(5, a, s)V{s')) . (16) 
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Value determination is equivalent to an "infinite" number of applications. The main chal- 
lenge is that we require convergence to optimality from any initial condition (V^, ttq). Under 
the current assumptions on the policy, modified asynchronous policy iteration is known to 
converge (monotonically) in the norm to a unique optimal 1/*, with corresponding opti- 
mal TT*, provided the initial pair (y^, ttq) satisfies Ttt-T^^ > (IBertsekas, 2007), where Tt,- is 
the DP mapping defined by ( [Tg] ). This initial condition is not in general satisfied here, since 
{V^,7Vo) may be set on the basis of transferred information and/or coarse scale solutions. 
In Algorithm [2] for instance, the initial value function is the initial coarse MDP solution 
on the bottlenecks, and zero everywhere else. Furthermore, a common fix that shifts by 
a large negative constant (depending on (1 — 7)"^) does not apply because it could destroy 
consistency across sub-problems, and moreover can make convergence extraordinarily slow. 

Alternatively, Williams & Baird show that modified asynchronous policy iteration will 
converge to optimality from any initial condition, provided enough value updates T^j; are 



ment found in (Williams and Baird, 1990 



applied ( [Williams and Baird[ |1990[ |1993[ ). The condition (15) adapts the precise require- 

Thm. 



to the present setting, where discount 



factors are state and action dependent. The proof follows (Williams and Bairdl 1993 Thm. 



4.2.10) closely, so we only highlight points where differences arise due to this state/action 
dependence, and due to the use of multi-step operators, T^. We direct the reader to the 
references above for further details. 

First note that if {sn)n is the Markov chain with transition law P^, 

{T:V){s)=E[R^ + A^V{sn)\so = s] 



where Rq, Aq are as defined in Sections 3.2.4 and 3.2.6 above. One can see this by defining 
a recursion Vn = T^^Vn-i with Vq = V, and repeated substitution of (16). Fix e > and 
choose N large enough so that 

limmfV\s) - e <V^{s) < lim sup ^ (5) + 6: 



for all k > N and all s. Let L* := maxg { limsup^^^ ^^(s) — limmfi^^V^ (s)^ ^ and let 
5* be any state at which the maximum L* is achieved. It is enough to show that L* = 



(convergence of the sequence V^) to ensure convergence to optimality (Williams and Baird 



1993, Thm. 4.2.1), however we note that this convergence need not be monotonic in any 
norm. The action of TJ} after N iterations can be bounded as follows: 



A:>iV 



inf {Tl^V''){s* 

k>N ^ 



A^(limsupy^(s„)+£) 



E, 



A^(liminfr(sn) 



<2£E,.[AS] + rE,.[A^] 



Following the reasoning in (Williams and Baird, 1993, Thm. 4.2.10, pg. 27), subsequent 



policy improvement at state x* can at most double the length of the interval S. Hence, 
L* <2S < 2eT + TL\ so that 



L* < 



2jn 
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for any 6 > 0, giving that L* = as long as 

T < 1/2. 

For the interior states, the condition 7^ < 1/2 is clearly satisfied, since we perform value 
determination at those states in Algorithm |2j ■ 



3.4 Complexity Analysis 

We discuss the running time complexity of each of the three steps discussed in the introduc- 
tion of this section: partitioning, compression, and solving an MMDP. For the latter two 
steps, the computational burden is always limited to at most a single cluster at a time. In 
all cases, we consider worst case analyses in that we do not assume sparsity, preconditioning 
or parallelization, although there are ample, low-hanging opportunities to leverage all three. 



Partitioning: The complexity of the statespace partitioning and bottleneck identification 
step depends in general on the algorithm used. The local clustering algorithm of Spielman 



and Teng (Spielman and Teng, 2008) or Peres and Andersen (Andersen and Peres, 2009) 



finds an approximate cut in time "nearly" linear in the number of edges. The complex- 
ity of Algorithm [l] above is dominated by the cost of finding the stationary distribution 
of Pteb aiid of finding a small number of eigenvectors of the directed graph Laplacian C. 
The first iteration is the most expensive, since the computations involve the full statespace. 
However, the invariant distribution and eigenvectors may be computed inexpensively. P is 
typically sparse, so Ptei is the sum of a sparse matrix and a rank-1 matrix, and obtaining 
an exact solution can be expected to cost far less than that of solving a dense linear system. 
Approximate algorithms are often used in practice, however. For example, the algorithm 
of ( Chung and Zhao]|2010 ) computes a stationary distribution within an error bound of e in 
time log(l/e)) if there are \E\ edges in the statespace graph given by P. The eigen- 

vectors of C may also be computed efficiently using randomized approximate algorithms. 



The approach described in (Halko et al., 2011) computes k eigenvectors (to high precision 
with high probability) in 0\\S\^ log k) time, assuming no optimizations for sparse input. 
Finding eigenvectors for the subsequent sub-graphs may be accelerated substantially by 
preconditioning using the eigenvectors found at the previous clustering iteration. 



Compression: As discussed above, compression of an MDP involves computations which 
only consider one cluster at a time. This makes the compression step local, and restricts 
time and space requirements. But assessing the complexity is complicated by the fact 
that non-negative solutions are needed when finding coarse transition probabilities and dis- 
counts. Various iterative algorithms for solving non-negative least-squares (NNLS) problems 



exist (Bjorck, 1996; Chen and Plemmons, 2010), however guarantees cannot generally be 



given as to how many iterations are required to reach convergence. The recent quasi-Newton 



algorithm of Kim et al. (Kim et al., 2010) appears to be competitive with or outperform 



many previously proposed methods, including the classic Lawson-Hanson method (Lawson 



and Hanson[ |1974[ ) embedded in MATLAB's Isqnonneg routine. We point out, however, 
that it is often the case in practice that the unique solution to the unconstrained linear sys- 
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terns appearing in Propositions [T] and |4] are indeed also minimal, non-negative solutions. In 
this case, the complexity is (9(|5c||cp + |c||5cp) per cluster, for finding the coarse transition 
probabilities and coarse discounts corresponding to a single fine scale policy. 

Solving for the coarse rewards always involves solving a linear system (without con- 
straints), since the rewards are not necessarily constrained to be non- negative. This step 
also involves (9(|5c||cp + |c||5cp) time per cluster, per fine policy. 

We note briefly, that these complexities follow from naive implementations; many im- 
provements are possible. First of all in many cases the graphs involved are sparse, and 
iterative methods for the solution of linear systems, for example, would take advantage of 
sparsity and dramatically reduce the computational costs. For example, solving for the tran- 
sition probabilities involves solving for multiple {\dc\) right-hand sides, and the left-hand 
side of the linear systems determining compressed rewards and discounts are the same. The 
complexities above also do not reflect savings due to sparsity. In addition, the calculation 
of the compressed quantities above are embarrassingly parallel both at the level of clusters 
as well as bottlenecks attached to each cluster (elements of dc). The case of compression 
with respect to multiple fine policies is also trivially parallelized. 



MMDP Solution: As above, the complexity of solving an MMDP will depend on the algo- 
rithm selected to solve coarse MDPs and local sub-problems. Here we will consider solving 
with the exact, (dynamic programming based) policy iteration algorithm. Algorithm [2j In 
the worse case, policy iteration can take l^l iterations to solve an MDP with statespace S, 
but this is pathological. We will assume that the number of iterations required to solve a 
given MDP is much less than the size of the problem. This is not entirely unreasonable, 
since we can assume policies are regularized with a small amount of the diffusion policy, 
and moreover, if there is significant complexity at the coarse scale, then further compression 
steps should be applied until arriving at a simple coarse problem where it is unlikely that 
the worse-case number of iterations would be required. Similarly, solving for the optimal 
policy within clusters should take few iterations compared to the size of the cluster because, 
by construction of the bottleneck detection algorithm, the Markov chain is likely to be fast 
mixing within a cluster. 

Assume the MDP has already been compressed, and consider a fine/coarse pair of suc- 
cessive scales. Given a coarse scale solution, the cost of solving the fine local boundary value 
problems (StepB is ^^eC^il^l^) (ignoring sparsity). Updating the policy everywhere on 

o I I — o 

S (Step ^ involves solving l^l maximization problems, but these problems are also local 

' — ' o 

because the cluster interiors partition S by construction. The cost of updating the policy 

o 

on S is therefore the sum of the costs of locally updating the policy within each cluster's 
interior. The cost for each cluster c G C is (9(|yl| |c| |c| + |^||c|) time to compute the right- 
hand side of Equation (11a) and search for the maximizing action. The cost of updating the 
policy and value function at bottleneck states is assumed to be negligible, since ordinarily 
\B\ <^ \S\. The cost of each iteration of Algorithm [2] is therefore dominated by the cost of 
solving the collection of boundary value problems. 

The cost of solving an MMDP with more than two scales depends on just how "mul- 
tiscale" the problem is. The number of possible scales, the size and number of clusters at 
each scale, and the number of bottlenecks at each scale, collectively determine the compu- 
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tational complexity. These are all strongly problem-dependent quantities so to gain some 
understanding as to how these factors impact cost, we proceed by considering a reasonable 
scenario in which the problem exhibits good multiscale structure. For ease of the notation, 
let n be the size of the original statespace. If at a scale j (with j = the finest scale) 
there are nj states and rj clusters of roughly equal size, an iteration of Algorthm [2] at that 
scale has cost 0(^rj{nj /rj)^). If the sizes of the clusters are roughly constant across scales, 
then we can say that rj = T^j/C for all j and some size C. If, in addition, the number 
of bottlenecks at each scale is about the number of clusters, then nj = n/C^ ^ and the 
computation time across logn scales is 0{n\ogn) per iteration. By contrast, global DP 
methods typically require O(n^) time per iteration. The assumption that there are logn 
scales corresponds to the assumption that we compress to the maximum number of possible 
levels, and each level has multiscale structure. If we adopt the assumption above that the 
number of iterations required to reach convergence at each scale is small relative to nj, then 
the cost of solving the problem is 0{n\ogn). 

4. Transfer Learning 

Transfer learning possibilities within reinforcement learning domains have only relatively 
recently begun to receive significant attention, and remains a long-standing challenge with 
the potential for substantial impact in learning more broadly. We define transfer here as 
the process of transferring some aspect of a solution to one problem into another prob- 
lem, such that the second problem may be solved faster or better (a better policy) than 
would otherwise be the case. Faster may refer to either less exploration (samples) or fewer 
computations, or both. 

Depending on the degree and type of relatedness among a pair of problems, transfer may 
entail small or large improvements, and may take on several different forms. It is therefore 
important to be able to systematically: 

1. Identify transfer opportunities; 

2. Encode/represent the transferrable information; 

3. Incorporate transferred knowledge into new problems. 

We will argue that a novel form of systematic knowledge transfer between sufficiently 
related MDPs is made possible by the multiscale framework discussed above. In particular, 
if a learning problem can be decomposed into a hierarchy of distinct parts then there is 
hope that both a "meta policy" governing transitions between the parts, as well as the 
parts themselves, may be transferred when appropriate. In the former setting, one can 
imagine transferring among problems in which a sequence of tasks must be performed, but 
the particular tasks or their order may differ from problem to problem. The transfer of 
distinct sub-problems might for instance involve a database of pre-solved tasks. A new 
problem is solved by decomposing it into parts, identifying which parts are already in the 
database, and then stitching the pre-solved components together into a global policy. Any 
remaining unsolved parts may be solved for independently, and learning a meta policy on 
sub-tasks is comparatively inexpensive. 

A key conceptual distinction is the transfer of policies rather than value functions. Value 
functions reflect, in a sense, the global reward structure and transition dynamics specific to 
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a given problem. These quantities may not easily translate or bear comparison from one 
task to another, while a policy may still be very much applicable and is easier to consider 
locally. Once a policy is transferred (Section |4.3[ ) we may, however, assess goodness of the 
transfer (Section |4.5[ ) by way of value functions computed with respect to the destination 
problem's transition probabilities and rewards. As transfer can occur at coarse scales or 
within a single partition element at the finest scale, conversion between policies and value 
functions is inexpensive. If there are multiple policies in a database we would like to test in 
a cluster, it is possible to quickly compute value functions and decide which of the candidate 
policies should be transferred. 

If the transition dynamics governing a pair of (sub)tasks are similar (in a sense to be 
made precise later), then one can also consider transferring potential operators (defined in 
Section [2^1] ). In this case the potential operator from a source problem is applied to the 
reward function of a destination problem, but along a suitable pre-determined mapping 
between the respective statespaces and action sets. The potential operator approach also 
provides two advantages over value functions: reward structure independence and localiza- 
tion. The reward structure of the destination problem need not match that of the source 
problem. And a potential operator may be localized to a specific sub-problem at any scale, 
where locally the transition dynamics of the two problems are comparable, even if globally 
they aren't compatible or a comparison would be difficult. 

Both policy transfer and potential operator transfer provide a systematic means for 
identifying and transferring information where possible. At a high-level, the transfer frame- 
work we propose consists of the general steps given in Algorithm [4j Transfer between two 
hierarchies proceeds by matching sub-problems at various scales, testing whether transfer 
can actually be expected to help, transferring policies and/or potential operators where 
appropriate, and finally solving the unsolved problem using the transferred information. 
Each of these steps is discussed in detail in the sections below. 



4.1 Notation and Assumptions 



In the following, MMDP(i), MMDP(2) will denote two distinct MDP hierarchies with un- 
derlying statespaces 81,82 and action sets Ai,A2, respectively. The notation Pi,Ri,Ti for 
i G {1,2} will in this section refer to the respective transition, reward and discount tensors 
for problems MMDP(^),z G {1,2}. To simplify the notation we will not explicitly attach 
cluster indices to these quantities, and assume an appropriate truncation/restriction (see 
Section |2. 2) that will be clear from the context. The notation c G MMDP(^) indicates that 



a cluster c is a cluster at some scale of the hierarchy MMDP(^). As before, c, dc denote the 
interior and boundary of the cluster c, respectively. For all objects, the scale in question 
will either be arbitrary or clear from the context. Unless otherwise noted, tt* refers to the 
optimal policy for MMDP(i) at the appropriate scale. Throughout this section, we will 
assume for simplicity that optimal source problem policies tt* are deterministic maps from 
states to actions. This assumption is not important for the main ideas discussed here, and 
is natural since transferred information is assumed to pass from a pre-solved source problem 
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Algorithm 4 High-level transfer learning steps. 



Given a pair of problems (MMDP(i), MMDP(2)) ^ solved multiscale MDP hierarchy 
for the first problem: 

1. Compute a hierarchy of MDPs for the second, new problem as described in Sec- 
tion [O 

2. For each scale j > 0, select the sub-problems (clusters) j^^"^^};. in the destination 
hierarchy where transfer should be attempted. 

3. Match the selected sub-problems j^^"^^}^-^ in MMDP(2) sub-problems in 



MMDP(i). (Section 4.2) 



(Optional) Match the statespace graphs of paired sub-problems using a suitable 
graph-matching algorithm. This can be done globally at a given scale, or locally 
for each matched pair of clusters from Step ([s]). (Section |4.6| ) 

Proceeding bottom-up, from the finest scale upwards: 

(a) If a pair of matched problems at the current scale is contained within a region 
of the previous, finer statespace where transfer has already occurred, remove 
the pair from the list of transfer possibilities. 

(b) For each remaining pair of matched sub-problems at the current scale, perform 
action correspondence and tentatively transfer the candidate sub-problem poli- 
cies and/or potential operators from MMDP(i) to MMDP^, along the states- 
pace correspondences determined in Step Q. (Sections 4.3, 4.4) 



(c) Determine transferability of the policies and/or potential operators solving 
the first problem to the matched sub-problems in the second problem from 
Step (Section |45l) 

(d) Retain only the transferred sub-problem policies/potential operators identified 



in Step (5c) to be transferrable. 



6. Solve MMDP(2) with an algorithm such as Algorithm [2| (or other variants discussed 



in Section 3.3), starting from the transferred policies and potential operator derived 
values as initial policies and guesses for Koarse (respectively), within the appropriate 
clusters and scales. 



to an unsolved or partially solved destination problem. In this case the policy for the solved 
source problem may be chosen deterministicp^ 

At the coarsest scale in a hierarchy, there is only one "cluster" and there are no local 
bottlenecks. To see the the coarsest scale as just a special case falling within a more general 
framework, we will treat the coarsest scale as a single cluster consisting of only interior 
states; the boundary will be the empty set. As will be explained below, partial transfer 
of a policy to a subset of the states in a cluster is possible, but since the coarsest scale 
usually involves a small statespace, full statespace graph matching between MMDP(i) and 

11. In any event, if the optimal policy is non-deterministic, one can still consider taking 7r°(s) = 
argmaxa 7r*(s, a) as the optimal policy for transfer purposes. 
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MMDP(2) should be inexpensive and potentially less error-prone. If this is the case, we 
may transfer into the entire scale, although the transfer will be seen as a transfer into the 
interior of a single cluster in order to fit within a common transfer framework. 

We set some further ground rules, since the range of possible transfer scenarios is large 
and diverse. We will restrict our attention to transferring policies and potential operators 
from a cluster Ci G MMDP(i) at scale j belonging to a solved source problem, to a cluster 
C2 G MMDP(2)5 ^Iso at scale j, belonging to an unsolved destination problem. We assume 
scales have been suitably matched, and do not address transfer between different scales. We 
will assume that if a matching 77 : C 5^2 ^ S'^ C S^i between the statespaces of the source 
and destination problems is given (see Section [46] below), it is a bijection onto its image. 
We do not treat degenerate situations, such as 77(02) H Ci = 0, and only consider transfer 
between subsets of Ci and C2 for which there is a given correspondence. 

Finally, when considering cluster-to-cluster policy transfers, we will focus our attention 
on transfer to the interior of a cluster only - bottlenecks on a cluster's boundary will not 
receive a transferred policy. Unless prior knowledge regarding the matching between clusters 
is available, we do not recommend transferring a policy to bottlenecks. Bottleneck states 
typically play a pivotal role in transitions across clusters, and transfer errors at bottlenecks 
can slow down the solution process. Assessing transferability (Section |4.5[ ) at bottleneck 
states forming the boundary of a given cluster is also more involved because one has to 
decide how to keep the problem of determining transferability for one cluster separate from 
that of the other clusters. Instead, solutions on the bottlenecks at a given scale should 
ordinarily be computed jointly as the solution to the next higher (compressed) MDP, or one 
can pursue transfer of an entire coarse scale (all bottlenecks simultaneously) if possible. 

4.2 Cluster Correspondence 

A correspondence between clusters at a given scale is established by pairing clusters deemed 
to be closest to each other in a suitable metric on graphs. A natural distance between graphs 
is given by the average pair-wise Euclidean distances between diffusion-map embeddings of 
the underlying states. Let G, be two directed, weighted statespace graphs corresponding 
to a pair of clusters of size l^l, IS''!, and let {^}ki {^^}k denote the respective collections of 



diffusion map embeddings computed according to Section |3.1.1[ Then we define 

1 



d{G,G') :=^^||(roC.)-C^| 



■ Z,J 

where r is the sign alignment vector defined by Equation ([s]). Given a pair of problems 
(MMDP(i),MMDP(2)), a cluster ci in MMDP(i) is matched to the cluster 

Co = arg min d(Gc^ , Gc^) 

^ C2GMMDP(2) 

in MMDP(2)5 where Gc denotes the weighted statespace subgraph corresponding to for 
some choice of tt (e.g. tt^). We will only compare clusters occurring at similar scales. 

4.3 Policy Transfer 

Given a pair of matched clusters Ci G MMDP(i),C2 G MMDP(2)5 we describe how the 
deterministic optimal policy tt* from MMDP(i) can be transferred to some or all of C2. 
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Figure 3: Policy transfer: action mapping. We set 7rc2(i^) = arg max^ P2('^, '^0 ^/'^ ^-^ matched 
with s, and w' is matched with s' = argmax^ Pi(s, 7r*(s), 



Policy transfer may be carried out at any scale, and for subsets of a cluster's interior states. 
For example, one may also find that there are sub-tasks which are not exactly the same as 
solved tasks in a database, but do nevertheless bear a strong resemblance. In these cases 
one may pursue partial policy transfer possibilities. Transferring via policies provides a 
convenient way to incorporate both full and partial knowledge: a transferred piece of the 
policy can serve as an initial condition for computing a value function everywhere in the 
cluster. 

Assume we are given a bijective statespace mapping r] : S2 ^ S[ such that S[ C Si, 
5^2 — ^5^2, C2 n ^ and r]{c2 n S2) ci ^ 0. That is, we require that 77 matches at 



least part of C2 with part of Ci. Statespace graph matching will be discussed in Section 4^ 
immediately below; here we will assume 77 is either given or simply taken to be the identity 
map. Next, let 

>V^ (7/(c2 n dom T]) n ci) (17) 

denote the subset of C2 with a correspondence in cj^ and assume that Ci,C2 are both 
associated to scale j. An important aspect of policy transfer is the mapping of actions 
along TT* in MMDP(i) to MMDP(2)- Figure |3] illustrates action mapping for an arbitrary 
state It; G C2. If r](w) = 5 G S^i, we can follow tt* by finding the state 7r*{s) is trying to 
transition to, 

= argmaxPi(5, 7r*(5), ^). 

Then if s' E Si corresponds back to E C2 via = r]~^{s'), the MMDP(2) action most 
likely to induce a transition between w and is taken to be the transferred policy at w: 

a* = argmaxP2('^5 '^0 

Once each w E Wry has been assigned an action, the remaining missing policy entries 
in C2 can be set to either the uniform distribution or to a previous policy guess. Abusing 



12. Recall that, by construction, any state at any level of an MDP hierarchy is a state from the finest scale. 
Regardless of the scale at which the clusters Ci,C2 occur, we may always consider subsets of Ci,C2 as 
subsets of the original underlying statespaces 5*1, S2 (resp.). 
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Algorithm 5 Policy transfer and subsequent solution process: high-level steps. 

1. For each MMDP(2) state in Wry, transfer the policy from the corresponding state in 
MMDP(i) by mapping actions from MMDP(i) to MMDP(2). 

2. For remaining states in C2, fill in the policy with either the diffusion policy or a 
previous guess, if available. 

3. Apply Algorithm [2] (or its variants) to MMDP(2) using the policy on C2 as an initial 
condition. 

4. Push down the resulting value function in C2 to the previous scale, and continue 
solving downwards, or stop and return the resulting policy on q if C2 is at the finest 
scale (j = 0). 



notation and using tt^ to denote either of the random uniform or previous policy, 

7rc,(s) = 7r«(5), se{£2\Wr,}. 

The resulting policy 7rc2 can then be used as a starting point for local policy iteration 
in C2 with e.g. Algorithm [2j A high-level summary of policy transfer and the subsequent 
solution process is given in Algorithm [5j In particular, a value function Vj everywhere in 
C2 is computed by solving the local boundary value problem 



ar^7Tc2 (5) 



J2 P2{s, a, s') {R2{s, a, s') + T2{s, a, s')Vj{s')) 

s'€C2 



. p o 

it 5 G C2, 



if 5 G dc2 



(18) 



where Vj^i is the value function associated to the coarse scale j + 1 (see Section 3.3.2). The 
value function Vj and its associated policy can then be propagated up or down the hierarchy 
using the ideas discussed in Section 13.31 For instance, MMDP(2) could be re-compressed 



from the scale at which C2 resides (scale j) upwards using an updated policy derived from 
Vj (possibly blended with a previous policy). The value function in C2 can also be used to 
solve downwards below the current cluster by applying Algorithm [2] to the previous scale 
with Vj serving as the initial coarse data. 



4.4 Transfer of Potential Operators 

Suppose Sj denotes the full statespace for problem i at scale j. At any (coarse) scale j > 
above the finest scale one can consider transferring the potential operator 

(/-(FioPi)"*)"' 

associated to the optimal policy tt* at scale j of problem MMDP(i). Here we let Pi,Ti 
generically denote the Markov transition tensor and compressed discount factors (respec- 
tively) at the relevant scale of problem MMDP(i). We will consider for simplicity of the 
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exposition the transfer of entire scales, and require for the moment that the statespace 
correspondence satisfy r]{S2) = S^. In general, potential operators specific to clusters may 
be readily transferred by extending the development here to include the ideas discussed in 



Section 3.3.2| (see Equation (13) in particular). 

Sub-problems at scales below j of MMDP(2) decouple from each other given values at the 
states in A value function V2 on 5^2, may be computed from the transferred potential 
operator and MMDP(2) rewards as follows. Let 7Vc2 denote the policy on S2 transferred from 
TT* according to Section 4.3 Next, consider the MMDP(2) rewards i?2, aggregated over one 



step with respect to {71^21 and mapped back to MMDP 



(!)• 



R2{r^-\s),a,r^-\s'))\ s ^ S{. (19) 



The expectation on the right-hand side defines a system of rewards on S-[ by collecting the 
(one-step) rewards in MMDP(2) following the policy on S2 determined by mapping tt* to 
MMDP(^2)- Since the policy tt* is deterministic, we do not take an expectation with respect 



to MMDP(i) actions anywhere in (19). The value function V2 is computed by applying Q 
to these rewards and mapping back to MMDP (2) along 77, 



Vj{s)^{GR2,i){rj{s)), 



s e si. 



(20) 



We draw attention to the fact that if the reward system for MMDP(2) depends on actions, 
then as shown in Equation (19) computing a value function requires a set of rewards aggre- 
gated with respect to a transferred policy ti^^- Iii general, transferring a potential operator 
therefore also entails mapping actions across problems, and transferring a policy. If the 



MMDP(2) rewards in C2 do not depend on actions however, then (19) reduces to a simpler 
expression not involving tTcs, however this situation is unlikely at any coarse scale. 

Using a potential operator from MMDP(i) to compute values for MMDP(2) sub-problems 
provides three major advantages. Value determination is fast, Od^^p) worst case, because 
the operator is given. The resulting value function for MMDP(2) at scale j also respects 
the specific reward structure of MMDP(2)- The third advantage is more subtle: the coarse 
MDP initially associated to scale j of MMDP (2) results from compression with respect 
to a stochastic policy guess, and may not be compatible with the optimal policy at scale 
j of MMDP(i). If we simply transferred tt* from MMDP(i), but used MMDP(2)'s coarse 



transition dynamics to perform value determination, it is likely that any improvement arising 
from knowledge of tt* would be eliminated. Assuming the transfer is viable, MMDP(i)'s 
potential operator will determine a value function that obeys the "correct" coarse-scale 
Markov process and provides a warm start towards finding the optimal fine-scale policy. 



4.5 Detecting Transferability 

Given a pair of matched clusters (ci G MMDP(i),C2 G MMDP(2)) 



we would like to know 



whether transfer of a policy or potential operator from Ci G MMDP(i) to some or all of C2 



can be reasonably expected to help solve MMDP(2)- As in Section 4.3 



we will restrict our 



attention to detecting opportunities for transfer only to the interior of C2. 
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4.5.1 Policy Transferability 



One way to approximately determine transferability of a policy is to check whether running 
MMDP(i)'s optimal pohcy tt* in C2 G MMDP(2) results in more aggregate reward on average 
than executing the current policy we have for MMDP(2)- most cases the "current" policy 
will just be the diffusion policy tt^, so we will overload this notation to indicate either 
of the current policy or the diffusion policy. If the reward collected following MMDP(i)'s 
policy is lower than that collected using the current policy, this suggests that tt* does not 
provide a warm start, and transfer should not be attempted. In addition, if no statespace 
correspondence is given then this test can be used to check whether assuming the default 
identity correspondence can still support transfer from Ci to C2. 

The value function in C2 given the current policy tt^ for MMDP(2) is computed in the 
usual way. Letting Vj^ denote the desired value function and Vj!^i denote the current value 
function at the next coarser scale j + 1, we must solve the system 



vns) = 



E, 



a~7r'"(s) 



J2 P2{s, a, s') (i?2(s, a, s') + T2{s, a, s')V;{s')) 



. p o 

it 5 G C2, 



if 5 G dc2. 



(21) 



The test for transferability compares Vj^ to the value function Vj describing rewards 
collected in C2 G MMDP(^2) running tt* from MMDP(^i)Computed according to Equa- 
tion (18). In other words, we transfer following Section 4.3, and then check whether we 



see any improvement relative to the current policy in C2. The result of the computations 
in Equations (18) and (21) can be reused during the first iteration of Algorithm [2] (or its 
variants) when solving the transfer problem. If similar underlying states of the environ- 
ment play different roles in the different tasks MMDP(^i), MMDP(^2)5 then Vj could differ 
significantly from Vj^. The two functions should be compared on all of q (not just Wry 
defined in Equation (p!7|). One can take a conservative approach and only pursue transfer 
if Vj{s) > Vj^{s),ys G C2. Or, if the situation is less clear, assessing the improvement may 
involve other heuristics. For example, a relative comparison such as 



^ o 
SGC2 



\V^{s)-Vf{s)\ 

\Vf{s)\-r\vns)=^) 



+ 1 



? 

> 0. 



(22) 



This test checks whether the policy tt* provides a "warm start" relative to tt^ given the 



transition dynamics and reward structure for MMDP(2)- If the inequality (22) above is 
satisfied, then we can proceed with transferring the policy from cluster Ci G MMDP^^ 



to C2 G MMDP(2)- Note that since interior cluster values are computed in Equation (18) 
with the cluster boundary fixed, the problem of assessing transferability from Ci to C2 is 
independent of other clusters in MMDP(i) or MMDP(2). 



4.5.2 Potential Operator Transferability 

The process for determining whether transferring a potential operator will be helpful or 
not is similar to the procedure for policies. Transferring a potential operator is equivalent 
to assuming that the dynamics of Ci G MMDP(^i) apply to C2 G MMDP(2)- Thus, as with 
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policies, it is worthwhile to consider comparing the expected discounted rewards collected 
while following the transition dynamics governing MMDP(i) in MMDP(2) to the no-transfer 
alternative. The expected reward starting from states in Wrj under potential operator 
transfer is given by Equation (20). If Wry is a proper subset of C2, then a value function 



everywhere on C2 can be obtained by solving a small boundary value problem. In this case, 
Wry is added to the boundary (in addition to C2's bottlenecks) and the values computed by 
Equation (20) serve as the boundary values for states in Wr^'. 



'a~7r'"(s) 



Vj{s) = { 



J2 P2is, a, s') {R2{s, a, s') + T2{s, a, s')Vj{s')) 

s'GC2 



, if5GC2\>V,^, 

if 5 G dc2 



where tt^ is the initial policy (diffusion or otherwise), and Vj^i is the initial coarse value 
function. Values for the remaining states are computed according to MMDP(2)'s rewards 
and transition probabilities, as this the only possibility in the absence of a wider correspon- 



dence. Analogous to the case of policy transfer, the computations in Equation (4.5.2) may 
be reused when solving the transfer problem. 

If we do not transfer the potential operator, we would otherwise just follow the current 
(or uniform stochastic) policy in C2 with the usual MMDP(2) dynamics. The expected 



discounted reward when there is no transfer is determined by solving Equation (21) as 



before. The final comparison between transfer/no-transfer can be performed on all of q, 
and may involve a heuristic such as Equation ([22]). If the reward system in MMDP(2) is 
strongly dependent on actions, then the quality of a potential operator transfer may also 
depend on the quality of the mapped policy ttcs by way of Equation (19). In such situations 
one can also assess the quality of 7Vc2 by using the procedure above. 



4.6 Statespace Graph Matching 

Establishing a correspondence between the discrete, finite statespaces of two problems can 
be an important prerequisite for some, if not most, types of transfer. Recall that a problem's 
statespace graph is a graph with states as its vertices and edges/weights defined by a 
transition probability kernel. Such a graph may, for instance, be characterized by a graph 
Laplacian of the type defined in Section |3.1.1[ The goal of a statespace graph matching is 



to establish a correspondence between the roles played by states in each problem. Consider 
for example two related problems MMDP(i), MMDP(2)5 each with a single terminal (goal) 
state. It would be desirable to be able to match the terminal states as "goals" , even if the 
terminal states are different in the sense that they have different representations in some 
underlying space (e.g. as features or coordinates in a Euclidean space). The same could be 
true for other states that play a pivotal role, such as "gateway" states directly connected to 
goal states. Graph matching ultimately seeks to abstract away problem-specific roles from 
the underlying state representations, and then match similar roles across problems. We will 
further illustrate this concept by way of several examples in Section [5j 

Although statespace matching can be important for a transfer problem, it can also be 
expensive computationally and imprecise in practice. For some problems defined on discrete 
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domains, key correspondences may need to be correct, otherwise the transferred information 
may actuahy diminish performance. For these reasons we do not require graph matching 
nor do we propose a fuh solution to the matching problem. We will restrict our attention 
to transfer scenarios where: 

(i) It is possible to use a default "identity" correspondence, or detect that that the identity 



is a poor choice and transfer should not be attempted (using the ideas in Section 4.5). 

(ii) The graph matching is relatively simple, and there is a limited potential for catas- 
trophic errors. For example, matching at coarse scales between small collections of 
states. 

The algorithm we will use to match statespace graphs is heuristic. Given two sets of 
states ci G MMDP(i),C2 G MMDP(2), 

1. Compute the pairwise diffusion distances d{si^Sj), Si G Ci,5j G C2, according to 



Section [3.1.1[ Note that this does not involve any particular underlying representation 
associated to the statespaces. 

2. Build the affinity matrix Wij = exp(— Sj)/a'^), for some appropriate choice of a 
(e.g. the median pairwise distance in the set). 

3. Apply any graph matching algorithm based on affinities. 



Graph matching is itself an area of active research, and several algorithms exist ( 


Fremuth- 


Paeger and Jungnickel 


, 1999 


; Sanghavi et al., 2007 


; Huang and Jebara, 


2007, 


2011 


). For 


a graph with \V\ vertices and 


E\ edges, the min-cost flow algorithm of (Fremuth-Paeger 



has (9(|y||£^|) running time. Jebara and collegues have improved 
upon this with a belief-propagation algorithm giving a running time of (9(|yp'^) on aver- 
age dHuang and Jebaraj [2057l |2011D (but 0(|y||£'|) in the worst case). 



The diffusion map embeddings may be computed either locally within clusters if Ci 
and C2 are contained in clusters at a coarser scale, or approximately with a small number of 
eigenvectors when Ci and/or C2 is the entire problem statespace. Specific problem knowledge 
may guide in many cases the choice of Ci and C2. For example, we may need to match a 
cluster C2 G MMDP(2) states in MMDP(i), but might reasonably expect that C2 can only 
correspond to a small number of states in MMDP(i) rather than all of Si. The examples 
discussed in Section[5]below illustrate graph matching and the transfer procedures suggested 
above in more detail. 



5. Experiments 



We will illustrate compression and transfer learning in the case of three examples: a discrete 
50 X 50 gridworld domain with multiscale structure, a 3-dimensional continuous two-task 



inverted pendulum problem, and pair of problems based on the "playroom" domain of Singh 
et al. (2005); Barto et al. (2004). The gridworld tasks require an agent to navigate to a 



goal location in a 2D environment. The inverted pendulum problem involves first moving 
a cart to a desired position, and then moving the cart while balancing the pendulum to 
another position. Finally, the playroom domain examples involve learning to carry out 
sequences of specific interactions with various objects and actuators in a desired order. The 
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setup of compression, transfer and transfer detection is the focus of this section, rather 
than an exhaustive performance comparison with other algorithms. For this reason, most 
of the performance plots below show error versus the number of algorithm iterations, even 
if different algorithms have dramatically different computational complexity per iteration 
(in particular, the proposed multiscale algorithms have a cost per iteration much smaller 
than global algorithms such as policy iteration). 

We will consider several multiscale algorithms, obtained by choosing different paths 
along the flow diagram in Figure [2j and different numbers of interior or boundary update 
iterations. Each variant has the basic structure of Algorithm [2j however the particular 
updates applied may differ. The following table summarizes the multiscale algorithms we 
will consider: 



Algorithm Name 


Interior Update 


Boundary Update 


00 


once 


once 


oc 


once 




or 


once 


recompress 


CO 


to convergence 


once 


cc 


to convergence 


Alg.§ 


cr 


to convergence 


recompress 



Table 1: Multiscale algorithms tested in the experiments. 



The designations "once" and "to convergence" refer to the number of updates applied to 
the interior/boundary states, before updating the boundary/interior. There are two cluster 
interior update possibilities: either we perform policy iteration in each cluster until conver- 
gence ( Ho convergence'^ - when the relative error between iterates falls below 0.01), or we 
apply only one policy iteration per cluster {"once''). To make comparisons fair, for algo- 
rithms iterating within clusters to convergence, each pass applying one local policy iteration 
update to all the clusters is counted as a single outer "algorithm iteration" in the plotj^ 
The bottlenecks (boundary) are updated either as in Algorithm [2j by way of repeated lo- 
cal averaging steps {"Alg. \^'), or by recompressing the fine scale MDP and then solving 
the resulting coarse MDP ("recompress"). For accounting purposes, a boundary update, 
regardless of type, is considered part of the same outer algorithm iteration as the immedi- 
ately preceding interior update (or pass over cluster interiors). In all experiments, the cost 
of initial hierarchy construction and transfer detection/policy-mapping (when applied) is 
not included, as they are only done once for a problem. 

Which of the algorithms is best suited to a given problem strongly depends on whether 
the initial data can be trusted. There are three kinds of initial data in question: the 
initial fine scale policy, the initial coarse value function, and the policy or policies used to 
initially compress the fine scale MDP. Empirically, we have observed the latter two types 
to be the most significant. If the coarse value function is trustworthy, then iterating within 
cluster interiors to convergence before updating the bottlenecks is generally optimal. Initial 
boundary information is allowed to propagate throughout the fine scale interior, and the 
boundary values are modified only after the interior cannot be improved further. This 

13. This is overly- conservative because, in general, convergence rates will be different across clusters. We 
have assumed that every cluster has the worst convergence rate. 
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Figure 4: (Left) Original grid world. (Right) Transfer to a world with similar multiscale structure 
and (relative) goals, but a completely different optimal policy. 



situation might arise when pursuing potential operator transfer, or if the boundary value 
function solves a coarse MDP compressed according to a pool of policies as in Section [3.3.1 



By contrast, if the initial coarse data is suspect, then we may choose to iterate the interior 
once or a small number of times, and then improve the boundary immediately afterwards. 
This may be the case if, for example, the initial coarse value function solves a coarse MDP 
compressed with respect to the diffusion policy. Applying many interior iterations can 
otherwise propagate erroneous information, and slow the solution process considerably. In 
short, when the initial coarse information is trustworthy, it should be leveraged as far as 
possible. Otherwise, if it is suspect, the coarse initial data should be imposed lightly. 

For those experiments where a pool of cluster policies was used to initially compress 



the fine scale (following Section 3.3.1) and there is recompression, we will effectively add 
the current fine policy to the existing pool of initial guesses, and use the augmented pool 
to recompress. This allows the solution at the coarse scale to ignore actions invoking 
the current fine policy if the actions corresponding to the initial guess policies are better. 
Under these conditions, the coarse value function can only increase, since we are providing 
additional actions beyond those resulting from the initial compression. Since each fine scale 
cluster policy corresponds to a coarse action, recompression is efficient in practice, and 
involves compressing only with respect to the new fine policy. One can concatenate new 
coarse probabilities, discounts, and rewards with those resulting from the initial compression 
and then proceed to solve at the coarse scale. In experiments involving initial compression 
with respect to the diffusion policy only, we recompressed using the current fine policy, 
and discarded coarse actions corresponding to the diffusion policy. In all experiments, 
compression involved blending each cluster policy with a small amount (A = 0.01) of the 
diffusion policy in order to preserve the boundary reachability assumption. 

5.1 Gridworld Domain 

In the gridworld domain, an agent must navigate within a two-dimensional world from an 
arbitrary starting point to a designated goal state. Two 50 x 50 gridworlds we will consider 
are shown in Figure [4| where grey blocks represent immovable obstacles (walls) and large 
grey circles denote terminal goal states. The actions available to the agent are up, down. 
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Figure 5: Detected gridworld bottlenecks (^x^) at scales ranging from fine to coarse (left to right). 

At each scale, non-zero entries of the corresponding compressed MDP^s transition matrix 
are plotted as links between bottleneck states, and depict the directed statespace graphs. 
The terminal goal state is marked by a circle. 



left, right. The four movement actions are reversible, and succeed with probabihty 0.9. 
Actions that would otherwise allow the agent to step on or through an obstacle fail with 
probability 1, and the agent remains in place. For all worlds, the reward function is set to 
— 1 for all states except the goal state, which is assigned a reward of +10. We assume that 
these transition probabilities P and rewards R are given. 

Bottleneck detection and partitioning was done once, before compressing, to a maximum 
of 3 scales. We then compressed the gridworld problem 3 times using the initial local 
guess policies described in Section |3.3.1[ Figure [5] shows, clockwise from top left, the 
detected bottlenecks (marked by 'x' characters) and statespace adjacency graphs (from the 
compressed transition probability matrices) superimposed on the original world for each 
successively coarsened MDP. The graphs are directed, however for readability we do not 
show directionality in the plots. One can see that as the problem is repeatedly compressed, 
clusters become successively lumped together into coarser approximations to the original 
world. A solution to the MDP at the first compressed level, for example, determines the 
optimal sequence of clusters the agent should traverse to reach the goal. Figure [6] shows 
policies resulting from solving the coarse MDPs, depicted as directed arrows marking a path 
along bottleneck states to the goal. For this problem, solutions to the coarse problems are 
compatible with the optimal fine scale policy. 

In Figure |4] (right), we show a gridworld to which knowledge may be transferred given 
a solution at some scale to the problem on the left. In the transfer world (right) the opti- 
mal policy and state-transition behavior is significantly different from that of the original 
world (left). The reward function is also different, since the goal has moved, however the 
multiscale structure is similar, and the goal is in the same cluster as before (though the 
cluster has moved). The optimal fine scale policy within clusters of similar geometry are 
also similar across problems. Indeed, for this world some or all of the solution at any of 
scales 1-4 may be transferred following the process discussed in Section |4j We will consider 
a simple transfer scenario in which the optimal fine scale policy is transferred wherever 
transferability detection (Section |4.5[) indicates it is advantageous to do so. Details dis- 
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Figure 6: Greedy coarse policies solving MDPs at successively coarser scales (left to right), visualized 
as arrows representing transitions between bottlenecks. The MDPs were compressed with 



respect to the local guesses (for each scale) described in Section 3.3.1 



cussing the application of transfer Algorithm |4] are given below. 

Cluster Correspondence: For this problem, the cluster correspondence algorithm described 
in Section [42] correctly pairs together the clusters in the source and destination problems. 
Figure [7| shows the partitioning of each world into clusters identified by the recursive spec- 
tral clustering step (Algorithm [T]), as well as the correspondences returned by the cluster 
correspondence algorithm. Within each world, clusters are demarcated by shade of color, 
and across worlds, gray lines connect the centroids of clusters that have been paired together. 



Transfer Detection: The transfer detection algorithm described in Section |4.5| was applied 
to each pair of matched clusters. It is clear that with the exception of clusters 1 and 8 
in Figure [7| the clusters are similarly oriented in both worlds. Thus for this problem, one 
should be able to skip statespace matching within clusters and rely on transfer detection to 
confirm whether this was ultimately a safe thing to do. Omitting a statespace matching at 
the fine scale is equivalent to assuming that paired subproblems have the same orientation 
with respect to the problem domain and bottlenecks. In general of course it is hard to 
know a priori whether identified sub-problems share the same orientation as a pre-solved 
problem stored in a database of solutions, and statespace matching at all scales involved 
in the transfer should be performed. Nevertheless, one can still attempt to assume the 
orientations are correct, and then detect whether this assumption is valid or not. This 
approach may be particularly fruitful whenever the fine scale statespaces are large and 
complex, so that graph matching is difficult and error-prone. For the present gridworld 
problem, the detection algorithm identifies clusters 2 — 7 as policy transfer candidates, and 
rejects clusters 1 and 8. This result coincides with our earlier visual intuition from Figure [7| 

Fine Scale Policy Transfer: Within clusters 2 — 7, the fine scale optimal policy for the source 
problem was mapped to the destination problem following the mapping procedure described 
in Section|43| States in the destination world which did not receive a policy by transfer were 
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given a deterministic policy that always recommends the up action, rather than a uniform 
distribution over all actions. Mapping the actions across worlds is easy in this case, since 
the action spaces are identical, and pieces of the optimal policy for the original problem 
largely transfer without modification. Comparing the two worlds in Figure [7| however, the 
detected bottleneck states are not always in the same place relative to a given cluster. For 
example, the two bottlenecks at the top-left corner of cluster 5 are one grid space to the 
right in the transfer world as compared to the original world. The policy transfer algorithm 
in Section 4^ maps a policy between cluster interiors along the established correspondence, 
which in our case is simply an enumeration of the cluster's states in column-scan order, 
from top-left to bottom-right. For cluster 5, the difference in relative positioning of the 
bottlenecks creates a misalignment between the source and destination clusters' interior 
states. For this particular problem, however, this misalignment imposes little error since 
the optimal policy is constant over large portions of the cluster. This is likely the reason 
why cluster 5 was identified as a good candidate for transfer, despite alignment errors at 
the fine scale. In general, policy transfer may be relatively robust to correspondence errors 
at the fine scale since, by construction, the underlying Markov chain is fast mixing within 
clusters. 



Transfer Problem Solution: Several multiscale algorithms listed in Table [5] were evaluated, 
with the transferred (fine scale) policy serving as the initial policy for local policy iteration 
in the destination problem. In all cases, the initial coarse scale value function was obtained 
by solving the coarse MDP given by compression with respect to the pool of local policy 



guesses discussed in Section 3.3. 1| and in all experiments, the blending parameter appearing 



in policy updates was set to A = 1, thereby imposing a greedy policy updating convention. 

Plots in Figure |8] with y-axis labeled "Error" show the Euchdean distance between 
the value function after t iterations (x-axis) of the given algorithm and the optimal value 
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(d) 



(e) 



^ure 8: Solution of the gridworld transfer problem described in Section 5.1: comparison among 
various multiscale algorithms, both with and without transfer, and comparison to policy 
iteration. See text for details. 
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function for this problenj^ Inset plots detail boxed regions. See Table [s] and surrounding 
discussion for a description of the algorithms and their labels. The default, fine-scale initial 
policy for experiments without transfer is an arbitrary deterministic policy that always 
chooses the up action. 

Figures |8(a)| and |8(b)| show the performance of various multiscale algorithms with and 
without fine scale policy transfer, respectively. Comparing the scale of the vertical axes 
across the two plots, transfer provides a good warm start for all algorithms. Because the 
initial coarse value function solves an MDP compressed with respect to a pool of initial policy 
guesses, we may expect that the initial coarse value function is trustworthy. For gridworld- 
type problems in particular, this is a reasonable expectation. Comparing curves within the 
figures, it is clear that the coarse initial data is indeed good: algorithms which leverage the 
initial condition as far as possible, and iterate inside clusters to convergence before updating 
the boundary ("MS-{cc, co, cr}" traces), perform better than the algorithms that only 
iterate over interiors once between boundary updates ("{MS-oo,oc}" traces). Algorithms 
updating the bottlenecks by recompression ("MS-cr" traces) are seen to converge to a 
suboptimal value function, however the corresponding policy is optimal after iteration 29 
in Figure |8(a)| and after iteration 33 in Figure |8(b)[ MS-cr is the single best algorithm 
for solving the gridworld problem, both with and without transfer, and MS-co is the best 
algorithm not involving recompression. 

Because we have only considered fine-scale policy transfer, we can compare to the perfor- 
mance of the canonical policy iteration algorithm given the transferred policy as the initial 
conditioij^ Figures 8(d) and 8(e) respectively compare policy iteration with and without 



transfer to multiscale algorithms co and oo, with and without transfer. The multiscale 
algorithm curves in these figures are the same as the corresponding curves in Figures 



and |8(b)[ Comparing curves within plots, it is clear that the transferred policy also pro- 
vides policy iteration with a helpful warm start. However, considering the difference in 
starting error between multiscale transfer/no-transfer curves to those of policy iteration, 
the multiscale algorithms are better able to take advantage of the transferred information. 
The improvement of the multiscale no-transfer curve over the policy iteration no-transfer 
curve reflects the improvement due to both coarse information as well as the multiscale 
approach. The two multiscale algorithms converge to optimality in about the same number 
of iterations, however for this domain MS-co exhibits stronger non-monotonicity. 

As discussed in Section [3. 3[ there are two primary reasons why the multiscale algorithm 
can perform better than policy iteration: The multiscale algorithm starts with coarse knowl- 
edge of the fine solution given by the solution to the compressed MDP, and the multiscale 
approach can offer faster convergence since convergence of local (cluster) policy iteration is 
constrained by faster mixing times within clusters rather than slow times across clusters. 
These are reasons why Algorithm [2] can converge in fewer iterations. But an iteration-count 
comparison to vanilla policy iteration is not entirely fair because each iteration of the mul- 
tiscale algorithm is significantly cheaper, as described in Section 3.4, Figure [8(cy shows 



elapsed wall time after t iterations for policy iteration and MS-oo algorithms. Experiments 



14. In this experiment and those that follow, we will use the L2 norm to measure error rather than Loo, as 
it is a more revealing indicator of progress over the entire statespace in question. 

15. Note that in general if there is transfer at coarse scales, then such a comparison is not possible since the 
policy iteration algorithm cannot directly take advantage of coarse information. 
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were conducted on a dual Intel Xeon E5320 machine running 64-bit MATLAB release 2010b 
under Linux, with no parallelization or optimization beyond the default BL AS/ Atlas mul- 
ticore routines embedded into native MATLAB linear algebra calls. It is evident that the 
multiscale algorithm compares favorably to policy iteration, both in terms of total iterations 
and per-iteration scaling in time. We note that the ratio of the slopes in Figure [8(c) is spe- 



cific to this example. In general this ratio is determined by the cardinality of the statespace 
and the number of identified clusters (statespace graph cuts), and will vary across problems. 

5.2 Continuous Two- Task Cart-Pendulum Domain 

The cart-pendulum problem is a classic continuous control task where an inverted pendulum 
attached to a cart on a track must be balanced by applying force to the cart. We consider 
a slightly more complex domain in which there are two additions: the cart must be moved 
to a particular goal location, and at some portions of the track the pendulum is held fixed 
and does not need to be balanced, while in other regions the pendulum is free and must be 
balanced. In all simulations, the length of the pendulum is 0.5m, its mass is 1kg, and the 
mass of the cart is 5kg. There are three actions, corresponding to applying horizontal forces 
of — 20N, ON, or +20N to the cart. These control inputs are subsequently corrupted at each 
time step by i.i.d. additive, zero-mean Gaussian noise with standard deviation a = 5. Three 
state variables are used, {0,0, x}, where is the angle of the pendulum from the vertical, 
0—^1 and X is the horizontal position along a track spanning the interval [—30m, 30m]. If 
the pendulum falls over {\0\ — 7r/2) or the end of the track is reached (|x| = 30), a reward of 
— 1 is received, and the simulation ends. Unless otherwise noted below, at any other state 
a reward of is received. Within this domain we will consider two different tasks: 

Default (MMDP(i)): The goal of the default task is to move the cart to position x — +20 
along the track, whereupon a reward of +100 is received, and the simulation ends. If 
the cart is at any position x > 0, the pendulum is held fixed in the upright position 
(^ = 0, ^ = 0), but the cart is free to move. Otherwise, if x < the pendulum is 
able to move freely as usual and must be balanced. If a simulation is started at some 
initial position xq < 0, then two sub-tasks must be solved in order: (1) The pendulum 
must be balanced while moving right until reaching x = ("balance"), and (2) the 
pendulum is held upright but must be carried while moving right towards x = 20 
("carry"). 

Transfer (MMDP(2))- The goal of the transfer task is the same: the cart must be moved to 
the position x = +20 along the track, whereupon a reward of +100 is received, and 
the simulation ends. The regions where the pendulum must be carried vs. balanced 
are swapped, however. If the cart is at any position x < 0, the pendulum is held 
fixed at (6> = 0, ^ = 0). Otherwise, if x > the pendulum is able to move freely 
and must be balanced. Thus, for simulations starting at some initial position xq < 0, 
the two sub-tasks that must solved occur in the opposite order (carry, balance): (1) 
Carry the pendulum while moving right until reaching x = 0, and then (2) balance 
the pendulum while moving right. 

The goal of transfer for this domain is to convey the ability to carry or balance the 
pendulum. The agent must still learn when to apply these skills, and in which order. Al- 
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though this particular pair of problems involves only two sub-tasks, it serves to illustrate 
how transfer within the multiscale framework we have described can be achieved in the 
context of a continuous control problem. Furthermore, in contrast to the other example 
domains, the domain described here has a multiscale structure induced by changes in the 
intrinsic dimension of the statespace; diffusion geometry is less helpful here. In light of 
these differences, we will consider alternative choices for the statespace partitioning and 
graph matching below. 

Simulation and Statespace Discretization: For both problems above, fine scale MDPs were 
estimated from Monte-Carlo simulations. For each problem, we simulated the systems with 
the diffusion (uniform random) pohcy for 8000 episodes. The initial stat^^for each episode 
was drawn according to (9,^ - ZY[-0.2, 0.2], x - ZY[-20,20],x - ZY[-0.1, 0.1], where U[a,b] 
denotes the uniform distribution on the interval [a, 6]. The simulation stepsize was set to 
At = O.I5, giving a lOHz control input. 

The resulting samples were normalized to have equal range in each coordinate, and clus- 
tered according to a S-net with S chosen so that we obtained approximately 500 represen- 
tative states from the pool of samples. These 500 states determined the discrete statespace 
on which the given MDP was defined. Absorbing states reached during the simulation were 
separately clustered and the resulting cluster representatives added to the previous 500 
states. This ensured that the terminal boundaries of the problem were clearly represented 
in the discretized statespace. The above procedure was applied separately to each problem, 
and the final size of the statespaces were 502 and 515, of which 56 and 75 were terminal, 
for the default and transfer problems respectively. Figures |9(a)| and |9(b)| show plots of 
these discretized statespaces. Within each figure, the left-hand plot shows states as circles 
in (x, 0, 9) coordinates, with large dark circles indicating terminal states. Right-hand plots 
graph states in (x, 0) coordinates. 

MMDP Construction: Given the simulation samples and clusters defined above, transition 
and reward statistics between the 5 regions claimed by the representative states were com- 
puted to estimate P(5, a, s') and R{s^ a, s') for the respective fine scale MDPs. Absorbing 
boundaries were enforced by forcing MDP states which had absorbing samples as their 
closest neighbor (out of all samples) to be absorbing. If any states were subsequently ren- 
dered unreachable, they were removed from the problem. Rewards for terminal states were 
similarly enforced by imposing the reward received at the neighbors nearest to the desig- 
nated absorbing MDP states. Collectively these steps ensured that the absorbing rewards 
and states - the boundary conditions - were sufficiently captured in the translation from a 
continuous to a discrete problem. 

To define a coarse scale MDP, we next partitioned the statespace into clusters and 
identified bottlenecks. For the problems described above, however, geometry is not as 
helpful, and we chose to pursue an approach different from the spectral clustering algorithm 



described in Section |3.1[ Here, there is a natural partitioning of the statespace based on 
intrinsic dimension; balancing is a 3D task, while carrying (without balancing) is a ID task. 
Thus, we detected where dimension changes in the statespace take place, and partitioned 



16. Our simulator uses the additional state variable x — ^ internally, however this variable is ignored 
externally (that is, during clustering, policy evaluation, etc.). 
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Figure 9: Discretized statespaces for the two different pendulum control problems described in Sec- 
tion\5.S\ A fine scale policy is transferred from the default problem having the statespace 



shown in (a) to the transfer problem with statespace in (b). Large circles indicate ter- 
minal states, while x indicates non-absorbing bottleneck states. See text for details. 
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accordingly. The right-hand plots in Figures |9(a)| and |9(b)| respectively show the result 
of this partitioning for the default and transfer problems. States are colored according to 
membership in one of two resulting clusters. One can see that the clusters clearly correspond 
to sub-tasks carry and balance. Terminal states are marked by large gray circles, and 
non-absorbing bottlenecks are indicated with (magenta) x's. As would be expected, in both 
problems the state just at the interface of the two clusters is a bottleneck. Some additional 
bottlenecks also result from the partitioning. 

On the basis of the clusters and bottlenecks shown in Figures [9 (a) | and |9 (b) | the transfer 
problem was compressed once with respect to the diffusion polic}|^^[ assuming a fine scale 
uniform discount rate 7 = 0.99. 

Policy Transfer: We next established cluster and fine scale statespace graph correspondences 
in order to transfer the fine scale policy from MMDP(i) to MMDP(2)- The action sets across 
problems are already in correspondence, so action mapping was not pursuecf^ Since the 
statespaces partition into clusters on the basis of dimension, we simply matched clusters 
having the same dimension across problems. In this way, the carry and balance subtasks 
were easily placed into correspondence (respectively). 

To construct a fine scale graph matching, we assumed that the statespace coordinate 
systems were already aligned across problems (that is, we assumed we knew which coor- 
dinate in MMDP(i) corresponds to the coordinate for x in MMDP(2)5 similarly for 9 
and 9). Letting Si denote the statespace of MMDP(^), we then considered a state mapping 
(/) : S2 ^ Si of the form 

<f>:{x,e,e)^ {fix),gie),h{e)). 

The cluster correspondence previously established induces a natural mapping between x 
coordinates; for instance, we simply mapped the x interval (—30,0] in MMDP(2) onto the 
interval [0,20) for MMDP(i) to define / on states within the carry cluster of MMDP(i). 
A similar mapping was constructed to define / on states within the balance cluster. The 
coordinate maps ^, h were taken to be the identity, since 0, 9 are directly comparable across 
problems. A statespace correspondence 77 was then defined based on the nearest-neighbor 
Euclidean distance under 0, assuming a neighbor search constrained to fall within matching 
clusters. If C2 G MMDP(2) has been matched to ci G MMDP(i), then 

r](s) = arg min ||N(5') - N((/)(5))||2, s G C2, 

where N(-) is the same coordinate- wise range normalization used in the state clustering 
steps above for MMDP(i). Given the fine scale state mapping 77, the optimal fine policy 
for the default problem MMDP(i) was transferred to MMDP(2) by transferring separately 
within matched clusters (that is, between matched sub-tasks). 



17. In this simple example, we illustrate the core ideas by carrying out only fine scale policy transfer into 
the finest scale of a hierarchy; thus, only the destination problem needs to be compressed. For general 
transfer into arbitrary levels of a hierarchy, compression of both problems would be required. 

18. In more complex problems where this may not be as obvious, the procedure described in Section 
be applied. 



4.3 



may 
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Value Function Error: transfer, diffusion initial compression policy 



Value Function Error: NO transfer, diffusion initial compression policy 




(a) 



(b) 



Value Fimction Error: MS-oc transfer vs. no-transfer 



Value Function Error: Policy Iteration transfer vs. no-transfer 




Figure 10: Solution of the pendulum transfer problem described in Section 5^: Comparison among 
multiscale algorithms both with and without fine scale policy transfer, and comparison 
to policy iteration. See text for details. 



50 



MuLTiscALE Markov Decision Problems 



Transfer Problem Solution: We applied several multiscale algorithms listed in Table [5] (see 
surrounding text for details describing these algorithms and notation) to explore the impact 
of the transferred policy. Experiments with transfer and without are compared, and we 
compare to ordinary policy iteration. Where transfer was considered, the initial fine scale 
policy was the transferred policy. The default, fine-scale initial policy for experiments 
without transfer is an arbitrary deterministic policy that always chooses the action which 
applies no force to the cart. In all multiscale experiments, the initial coarse value function 
was the value function solving a coarse MDP resulting from compression with respect to 
the diffusion policy. The blending parameter appearing in policy updates was set to A = 1. 



In Figure 10 we show plots detailing the Euclidean distance (y-axis, "Error") between the 
value function after t iterations (x-axis) of the given algorithm and the optimal fine value 
function for MMDPr 



(2)- 



Figures 10(a] 



and 



10(b) show the performance of the multiscale 
algorithms as well as policy iteration ( PI"), respectively with and without fine scale policy 
transfer. Here, error is plotted on a logarithmic scale. Comparing these two plots at t = 1 
iteration, transfer provides a warm start for all algorithms. Furthermore, the improvement 
seen in Figure |10(a) can be entirely attributed to transfer, since the multiscale algorithms 
converge as fast as or slower than policy iteration, and do not contribute any improvement 
in convergence rate in and of themselves for this problem. (Of course, the complexity of each 
iteration is still much smaller for the multiscale algorithms compared to global algorithms 
such as policy iteration.) 



From Figures 10(a) and |1 0(b)] it is also evident that MS-oc is the single best multiscale 
algorithm, and has nearly the same rate of convergence as policy iteration (we note that after 
t = 3 iterations the relative error has decreased to < 1%, and the problem is nearly solved). 
Figure [To (c) gives a more detailed view of the improvement transfer confers in the context of 
MS-oc. Error is plotted on a linear scale to make the difference more visible. Figure p"0(d)| 
compares policy iteration with and without the transferred policy as the starting point. 
Comparing Figure 10(c)| to [lO(d)| the improvement due to transfer as well as the convergence 
rates are seen to be similar. 



Algorithms involving recompression, MS-cr and MS-or, do not converge to optimality 
for this problem, and MS-oo,MS-co converge very slowly. That the MS-oo,MS-co algorithms 
converge slowly relative to MS-oc confirms the importance of the bottleneck near the origin, 
and by extension, the updates at this bottleneck. The fact that MS-cc converges more 
slowly than MS-oc suggests that either of the coarse or fine initial data contain some errors. 
Forcing too much of the initial coarse value function or fine scale policy results in poorer 
performance here. For this domain, however, we found that compression with respect to 
a pool of policies (Section 3.3.1, simulations not shown) does not yield an initial coarse 
value function giving any better performance than the coarse value function derived from 
compression with respect to the diffusion policy. This is likely due to the fact that there 
are few non-absorbing bottlenecks, and only the bottleneck near the origin evidently plays 
a key role; the gradients obtained from either coarse value function contain comparable 
information in this case. 
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5.3 Playroom Domain 



We consider a simplified version of the playroom domain introduced in Singh et al. (2005); 



Barto et al. (2004). In our formulation, an agent interacts with four objects in a room: a 



ball, a bell, a music button and a light switch. The actions available to the agent are: 



1. Look at a randomly selected object. (Succeeds with probability 1). 

2. Place a marker on the object the agent is looking at. 

3. Press the music button. 

4. Kick the ball towards the marker. 

5. Flip the light switch. 



All actions except the first succeed with probability 0.75. In order to take the latter three 
actions, the agent must first be looking at the relevant object unless otherwise noted (see 
modifications below). If the ball is kicked into the bell, the bell rings for exactly one time 
period. If the light switch is flipped to the on position, the light stays on for exactly one 
time period, and then switches to the off state. The state is 5-dimensional, and consists of 
the following variables: 

1. Object the agent is looking at. 

2. Object the marker is currently placed on. 

3. Music on/ off. 

4. Beh on/off. 

5. Light on/off. 



We will consider two pairs of problems within the playroom domain to illustrate com- 
pression and transfer. Each pair consists of a baseline task and a variation task. For a 
pair of tasks, the rules governing the environment will remain fixed, however the goal of 
the tasks will change. The objective is thus to apply knowledge gained from solving one 
problem in the environment towards solving another problem in the same environment. 

To build MDP models, the tasks were independently simulated for 1000 episodes of max- 
imum length 1000 actions. Each trial was ended upon reaching either the goal state, or the 
maximum number of actions. Since the statespaces are small, samples were simply binned 
according to the underlying state variables. Transition probabilities were then estimated 
empirically from the samples. Rewards were set to +10 for transitions to the goal state, and 
to —1 for all other transitions. In all cases, we fixed the discount parameter to 7 = 0.96. 
Next, the spectral clustering procedure described in Section [3T] was applied, stopping after 
a single iteration. Note that one iteration of Algorithm [l] can potentially result in more 
than two parts, since a single cut can produce multiple disconnected subgraphs. We next 
compressed the tasks once, assuming the diffusion policy at the fine scale. For the baseline 
tasks from which information is transferred, the MDP hierarchies were solved to the optimal 
solution following Algorithm |2j 
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5.3.1 Coarse Transfer Example 

In this exaraple we illustrate the transfer of a coarse scale potential operator from one task 
to another. We will refer to the problem supplying the potential operator as the default 
problem, and the problem into which we transfer information as the transfer problem. 
The default problem is assumed to be solved, in the sense that coarse MDPs have been 
compressed with respect to optimal policies, and the policy used to define the potential 
operator is optimal. 

For the pair of playroom problems we will explore in this section, the light is turned 
on for one period by taking the "flip the light switch" action with the marker on the bell, 
while looking at the bell. These are the same conditions for ringing the bell, only now the 
agent can alternatively flip the light on. The tasks are as follows: 

Default (MMDP(i)): The goal of the default task is to cause the bell to ring while the music 
is playing. Note that there is no action that will directly ring the bell. The agent 
must: look at the music button, press the music button, look at the bell, place the 
marker on the bell, look at the ball, and finally, kick the ball into the bell. Because 
the bell only rings for a single period, the music must be turned on before ringing 
the bell. In this task, the light switch does not play a role. Each episode begins with 
the agent looking at a random object, the marker on a random object, and all on/off 
objects in the off state. 

Transfer (MMDP(2))- The goal is to flip the light switch to the on position while the music 
is playing. However, to reach the goal state the agent must still have the marker on 
the bell, and must be looking at the ball in order to take the "flip light switch" action. 
That is, the role of the light switch and ball kicking actions in solving the problem 
have been swapped. The agent must: look at the music button, turn on the music, 
look at the bell, place the marker on the bell, look at the ball, flip the light switch. 

The difference between the default and transfer tasks is that the final action leading to the 
goal state has been switched and the underlying goal state itself has changed. 



Figure 11 shows a 2D diffusion map (Coifman et al. , 2005) visualization of the statespace 



graphs for each of the two tasks. Even with two coordinates, the graphs appear nearly 
identical so that statespace graph matching should not be difficult. Bottlenecks are marked 
by 'x', ordinary states with 'o', and goal (terminal) states are boxed. Non-bottleneck 
states are colored according to membership in one of the two possible identified clusters. 
Edges connect pairs of states for which there is a non-zero transition probability given 
some action. Although state emheddings may be similar, the underlying states can be very 
different. As can be seen from the plots, the goal states in particular have similar diffusion 
map coordinates, but of course represent different underlying states of the environment. 
For both tasks, spectral clustering resulted in two clusters and seven bottlenecks. 
To demonstrate coarse transfer from the default task to the transfer task, we will: 

1. Match bottlenecks across problems by coarse scale statespace graph matching, follow- 



ing Section 4.6 



2. Compute values for the transfer tasks's bottlenecks by transferring the default task's 



potential operator following Section 4.4, along the coarse statespace correspondence 
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Figure 11: Diffusion map embeddings of the statespaces for the playroom coarse transfer example 



of Section 5.3.1' default (left), and transfer (right) tasks. See text for details. 



determined in Step (1). For this problem, the coarse action mapping determined 
fohowing Section [43] is simply the canonical correspondence defined by executing the 
fine scale pohcy in a designated cluster. If action a in MMDP(i) means "execute 
the fine policy in cluster c", then this action is mapped to an action a' in MMDP(2) 
corresponding to "execute the fine policy in cluster c'", if cluster d is matched to 



cluster c following Section 4.2 



3. Push the coarse solution down to solve the transfer task at the fine scale following 
multiscale Algorithm |2] and variants thereof listed in Table [Sj 

Bottleneck Matching: Any graph matching algorithm may be used. We used the procedure 



described in Section 4.6, together with the matching algorithm of Huang and Jebara (2011) 
(using their freely available MATLAB implementation), to confirm that the matching for 
this problem can be done easily. The bottlenecks and their correspondences were as follows 
(matched bottlenecks are listed on the same row): 



Default 

(look, marker, music, bell, light) 



Transfer 



(ball, bell, on, on, off) 
(music, ball, on, off, off) 
(music, music, off, off, off) 
(music, bell, on, off, off) 
(music, light, on, off, off) 
(bell, bell, off, off, off) 
(bell, bell, on, off, off) 



(ball, bell, on, off, on) 
(music, ball, on, off, off) 
(music, music, off, off, off) 
(music, bell, on, off, off) 
(music, light, on, off, off) 
(bell, bell, off, off, off) 
(bell, bell, on, off, off) 



The the goal states in each problem (top row) are successfully paired, while the rest of the 
bottlenecks are identical across tasks. 
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Transfer Problem Solution: We evaluated several multiscale algorithms listed in Table[5](see 
surrounding text for a description). For all algorithms, the blending parameter appearing 
in policy updates was set to A = 1. In all non-transfer experiments, the initial coarse 
scale value function was obtained by solving the coarse MDP given by compression with 
respect to the diffusion policy. The diffusion policy was chosen over the pool method 



in Section 3.3. 1| for simplicity, so that actions across the two problems could be placed 



into a natural correspondence, and so that error due to the action mapping would not be 
conflated with other sources of error. In all experiments, the initial fine scale policy was 
chosen arbitrarily to be the (deterministic) policy which always takes the look action. 



Figure [12] compares performance among multiscale algorithms and to the canonical pol- 
icy iteration method. As before, vertical axes labeled "Error" show the Euclidean distance 
between the value function after t iterations (x-axis) of the given algorithm, and the op- 
timal value function for this problem. Inset plots detail boxed regions. See Table [5] for a 
description of the algorithms' labels. 



Figures 12(a)| and |12(b)| show the performance of various multiscale algorithms with and 



without fine scale policy transfer, respectively. Comparing the two plots, transfer provides a 
good warm start (lower starting error), and also affords faster convergence (fewer iterations) 



for all multiscale algorithms. For transfer experiments (Figure 12(a) ), the coarse initial value 
function comes from transfer, and we may reasonably assume it is reliable. For this reason, 
algorithms which leverage the initial coarse information as far as possible, and iterate inside 
clusters to convergence before updating the boundary ("MS-{cc, co, cr}" traces), perform 
better than the algorithms that only iterate over interiors once between boundary updates 
("MS-{oo , oc , or}" traces). In contrast, for the experiments which did not involve transfer 
(Figure 12(b)[ ), the MS-{cc , co , cr} algorithms exhibit slower convergence than the MS- 



{oo,oc,or} family. Since the initial coarse value function was derived from a fine scale 
diffusion policy in the no-transfer setting, we can conclude, as one would expect, that the 
initial coarse estimate was not entirely reliable. When there is potential operator transfer, 
algorithms MS-{cc,co,cr} are equally good for solving the problem, and MS-cc is the 
best algorithm not involving recompression. In the absence of transfer, MS-or performs 
best and converges faster than policy iteration, while MS-oc is the best algorithm not 
involving recompression. Although in this example the recompression algorithms ("MS- 
{or,cr}" traces) do not converge to the optimal value function, the sequence of policies do 
converge to the optimal policy. All of the multiscale algorithms reach optimal policies in 
fewer iterations than policy iteration in the case of transfer. When there is no transfer of 
information, and the initial coarse scale data is unreliable, then the multiscale algorithms 
not involving recompression can take more iterations to converge as compared to policy 
iteration. However, as mentioned previously, each iteration of the multiscale algorithms is 
substantially faster (involving local computations) than iterations of policy iteration, which 
is a global algorithm (see Section [ST] for a discussion regarding this point). 



In Figures [12(c) and |12(d')] we compare algorithms MS-cc and MS-oc, with and without 



transfer, to the policy iteration algorithm on a linear scale. Policy iteration in all cases starts 
from the same initial fine scale policy as the multiscale algorithms. These plots more clearly 
demonstrate the advantage afforded by the coarse scale transfer: traces labeled "Transfer" 
confirm that there is both a warm start (lower starting error) and faster convergence (fewer 
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iterations). The effect is also more pronounced in the case of MS-cc, since this algorithm 
maximally leverages the transferred information. 

5.3.2 Partial Policy Transfer Example 

This example illustrates partial transfer of a policy at the fine scale of a two scale hierarchy. 
The pair of problems in this section differ from the previous section only in that to turn 
on the light, the agent must be looking at the light switch and the marker must be on the 
bell. Now, turning on the light differs from ringing the bell by two actions. The tasks are 
as follows: 

Default (MMDP(i)): The goal is to cause the bell to ring while the music is playing. The 
agent must look at the music button, press the music button, look at the bell, place 
the marker on the bell, look at the ball, and finally, kick the ball into the bell. The 
light switch does not play a role. 

Transfer (MMDP(2))- The goal is to flip the light switch to the on position while the music 
is playing. The agent must: look at the music button, turn on the music, look at the 
bell, place the marker on the bell, look at the light switch, flip the light switch. 

As before, each episode begins with the agent looking at a random object, the marker on a 
random object, and all on/off objects in the off state. 

Although this pair of problems involves only one additional action change in the sequence 
leading up to the goal as compared to the previous section's pair, the detected bottlenecks at 



the coarse scales cannot be easily matched. Figure 13 shows 2D diffusion map visualizations 



of the statespace graphs for the two tasks described in this section. Again, bottlenecks are 



marked by 'x', ordinary states with 'o', and goal states are boxed. In contrast to Figure pT 
here it can be seen comparing the default task to the transfer task, that some bottlenecks 
become interior states and vice versa. Thus, direct matching of the coarse scale statespaces 
(assuming a two layer compression hierarchy) and subsequent coarse scale policy transfer 
is not an immediate possibility here. 

What we might hope, however, is that we can transfer the portion of the fine scale policy 
dealing with states in which the music is off. It is only after the music is on that the optimal 
action sequences for the two tasks diverge, and when the music is off the immediate sub-goal 
for both tasks is to turn it on. Figure [T3| confirms this possibility: interior states are colored 
according to the spectral partition as before, and the clusters in this case correspond to 
"music ON" vs. "music OFF" statej^ As can be seen from the plots, the sign of the 
(Fiedler) eigenvector (f2 gives this partitioning. The procedures described in Section [4] were 
next applied to the current pair of tasks in order to detect transferability and effect policy 
transfer. 



Cluster Correspondence: The cluster correspondence step (Section 4.2) correctly paired 
together "music OFF" and "music ON" clusters, respectively. The pairwise cluster distances 
were found to be: 



19. Of course, one does not need to know or identify what the clusters mean in order to transfer something; 
we provide this explanation for illustrative purposes. 
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Value Function Error: coarse transfer 



4 Value Function Error: no-transfer, diffusion initial compression policy 



:::: 


— G— Ms-oo; 
— ^ — Ms-oc; 

□ MS-CC^ 
MS-co: 
MS-cr- 
MS-or 
•— PI 


:::: 













3 4 5 6 

Iterations 



(a) 



Value Function Error: MS-cc vs. policy iteration comparison 




10 12 




10 11 12 



(b) 



Value Function Error: MS-oc vs. policy iteration comparison 



- MS-oc: Transfer 

- MS-oc: No-Transfer 

- PI 



(c) 




Figure 12: Solution of the playroom coarse transfer problem described in Section 5.3.1' Compari- 
son among multiscale algorithms both with and without coarse scale potential operator 
transfer, and comparison to policy iteration. See text for details. 
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Figure 13: Diffusion map embeddings of the statespaces for the partial playroom transfer example 



of Section 5.3.2: default (left), and transfer (right) tasks. See text for details. 





MMDP(2): Cluster 1 


MMDP(2): Cluster 2 


MMDP(i): 
MMDP(i): 


Cluster 1 
Cluster 2 


0.2850 
0.3583 


0.4115 
0.3201 



Here, Cluster 1 is the "music OFF" cluster and Cluster 2 is the "music ON" cluster. 



Detecting Transferability: Next, the transfer detection algorithm described in Section [4^ 
was applied separately to the pairs (Cluster 1 G MMDP(i), Cluster 1 G MMDP(2)) 
(Cluster 2 G MMDP(i), Cluster 2 G MMDP(2)) following the cluster correspondence above. 
No statespace graph matching was done for this example, so we are effectively assuming 
that that the roles of the states in each problem are the same. Assessing transferability is 
therefore important in order to determine if and where this assumption might hold. Fig- 
ure 14 shows the value functions calculated using MMDP(i)'s optimal policy (green box 
traces, labeled Vq for scale j = 0) and the diffusion policy (blue circle traces, labeled Vq). 
States inside MMDP(i), MMDP(2) cluster intersection regions are plotted with small open 
points, and large filled points identify all other states. The states (horizonal axis) are or- 
dered according to the magnitude of Vq for improved readability. The left-hand plot shows 
values for states in Cluster 1 ("music OFF"). The transferred policy clearly leads to more 
expected reward everywhere, and the two value functions follow a similar general trend. 



The right-hand plot in Figure 14 shows value functions on the states in Cluster 2 ("mu- 
sic ON"). Here there is large disagreement on several states, suggesting that applying the 
optimal policy from MMDP(i) to MMDP(2) Cluster 2 could be problematic. This is 
not surprising considering that the goal states for both tasks are either inside or connected 
to the respective problem's Cluster 2. Indeed, the test given by Equation (22) produces 
T = —1.31 in Cluster 2, while T = +6.64 in Cluster 1. We conclude that transfer in Cluster 
1 should be attempted, but transfer in Cluster 2 should not be attempted in the absence of 
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Figure 14: Transfer detection. Value functions for Cluster 1 (left) and Cluster 2 (right), shown over 
the entire respective cluster interiors. States inside intersection regions are plotted with 
small open points, and large filled points identify all other states. The value functions 
were obtained using MMDP(i) ^s optimal policy following Section ^J^ ct'i^d (^re plotted in 
ascending sorted order according to the magnitude ofVo. See text for details. 



a better statespace mapping. 



Policy Transfer: With confirmation that transfer within Cluster 1 could be helpful, the 
optimal policy from MMDP(i) was transferred to the overlapping interior MMDP(2) Cluster 
1 states following the process described in Section |4.3[ As mentioned earlier, the identity 
correspondence between the relevant underlying states in each task was assumed, and the 
actions mapped accordingly. For this particular problem the actions do not change, but we 
followed the general mapping process anyhow since one does not generally know a priori 
whether actions need to be changed, or whether the representation of the two problems is 
such that actions are known by the same labels or not. The initial policy, post transfer, as 
well as the optimal policy on the interior of MMDP(2) Cluster 1 were as follows: 



Initial 


1 


1 


1 


1 


1 


7 


7 


7 


2 


2 


2 


1 


1 


1 


1 


1 


Optimal 


1 


1 


1 


1 


1 


3 


3 


3 


2 


2 


2 


1 


1 


1 


1 


1 



where 1 means "look at a random object", 2 means "place the marker" and 3 means "press 
music button" (see action definitions at the top of Section 5.3). Question marks in the 
initial policy indicate states which did not recieve a policy from MMDP^^). Cluster 1 in 
MMDP(i) contained 13 interior states, while Cluster 1 in MMDP(2) contained 16; thus the 
maximum of 13 policy entries were transferred to MMDP(2)- Unknown states are given a 
default guess below. As can be seen in the table above, for this task the transferred policy 
entries correctly matched the optimal policy for MMDP(2)- 
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Value Function Error: transfer vs. no-transfer (pool, recompression) 



Value Function Error: transfer vs. no-transfer (pool, no recompression) 




(c) 



Value Function Error: transfer vs. no-transfer (diffusion, recompression) 



Value Function Error: transfer vs. no-transfer (diffusion, no recompression) 




^ure 15: Value function errors for the playroom partial transfer example of Section p. 3.^ com 
parison between policy iteration, and multiscale algorithms both with and without com 
pression. See text for details. 
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Transfer Problem Solution: Figure 15 compares performance both with and without partial 
fine-scale policy transfer, across several different solution algorithms. The four curves in 
each plot correspond to different initial conditions and/or solution algorithms, and give the 
Euclidean distance ( "Error" , vertical axis) between intermediate value functions computed 
after t iterations (horizontal axis) of a given algorithm, and the true, optimal value function 
for MMDP(2)- Traces labeled with the prefix "PI: No- Transfer" correspond to vanilla policy 
iteration on the global statespace, with no transfer information, while the traces labeled 
"PI: Transfer" correspond to policy iteration starting from the transferred fine scale policy. 
The "PI: No- Transfer" curve is the same in all plots, and the "PI: Transfer" curve is the 
same in all plots except Figure |15(c)[ Traces labeled "MS: Transfer" and "MS: No-transfer" 
refer to different multiscale algorithms appearing in Table [5j with and without initial coarse 



solutions obtained following Section |3.3.1| and with and without fine-scale policy transfer 
(respectively). In the transfer experiments, the transferred policy served as the initial policy. 
In all cases, the blending parameter was set to A = 1 (no blending), giving a purely greedy 
policy update. Arrows in the plots mark the point at which the policy has converged to the 
optimal (fine) polic}/^^, The particular multiscale algorithms and conditions we tested in 



each plot are as follows: 



Figure 


Initial Compression 


Multiscale Algorithm 


Figure 


15(a) 




pool 


cr 


Figure 


15(b) 




pool 


cc 


Figure 


15(c) 




pool 


cr , cc 


Figure 


15(d) 




diffusion 


or 


Figure 


15(e) 




diffusion 


oc 



See Table [5] and surrounding discussion for a description of the algorithms. The "Initial 
Compression" column above specifies whether the initial coarse value function solved a 



coarse MDP compressed with respect to a collection of policies as described in Section [3.3.1 
( '^pooV^)^ or with respect to the diffusion policy ( ^^diffusion'^). We assume that the initial 
coarse value function under the pool condition is trustworthy, and choose multiscale algo- 
rithms which iterate within cluster interiors to convergence before updating the boundary. 
For initial coarse value functions derived from the diffusion policy, we assume there could 
be errors and opt for multiscale algorithms which only update cluster interiors once before 
each boundary update. 

Several conclusions (specific to this problem domain) may be drawn from these experi- 
ments: 

1. The impact of the transferred policy is essentially only noticeable when used in 
conjunction with a good initial coarse guess (Figs. |15(a)|15(b)| ). Both algorithms 
(recompression/no-recompression) give similar performance. The recompression-based 
algorithm does not converge to the optimal value function, although the corresponding 
policy sequence does converge to the optimal policy. 

2. For the canonical policy iteration algorithm, using the transferred policy as the initial 
condition gives only a slight advantage. However, policy iteration is far less robust to 



20. This is clearly not detectable in practice, but is informative in the context of recompression-based 
algorithms which do not converge to the optimal value function in this example. 
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errors in the transferred policy than the family of multiscale algorithms. Figure 15(c) 
shows the result of transferring the entire fine policy for MMDP(i), even in the cluster 
where transfer detection suggested transfer could be error prone. The multiscale 
algorithm with recompression is labeled "MS-c", and without compression "MS-nc". 
For this particular problem, the multiscale algorithms are tolerant of these errors, and 
convergence is in fact faster. (We emphasize, however, that this may not at all be 
true for other problems.) Policy iteration, however, suffers, and takes additional time 
to correct errors in the second (error-prone) cluster's policy. 

3. When the diffusion policy is used for compression (Figures [l 5 (d) 1 1 5 (e)] ) , transfer has 
little impact. Furthermore, recompression during the solution process is necessary to 
quickly correct errors imposed by a poor initial coarse value function. The algorithm 
involving local bottleneck updates (Figure [l 5 (e)| ) requires more iterations to converge 
to optimality, as compared to the other algorithms. 

4. Ignoring cost per iteration, most of the improvement of the multiscale algorithms 
over policy iteration come from the algorithms themselves rather than the transferred 
information. However, in large complex domains where clusters may themselves be 
complex tasks, even small transfer improvements may lead to substantial savings. 

6. Related Work 

Our work has many points of contact with the literature, and we do not attempt a compre- 
hensive comparison. We highlight the main, most important similarities and differences. 

There are several overarching themes which distinguish our work from much of the 
literature: 

• Multiscale structure: Multiscale is is a unifying, organizational principle in our work. 
Our approach enforces a strong multiscale decomposition of tasks into subtasks, such 
that each scale may be treated independently of the others. Hierarchies of arbitrary 
depth may be easily constructed. Many approaches ultimately require some form of 



"flattening" (see for instance HAMs (Parr and Russell, 1998), options Sutton et al. 



( 1999[ )), or do not generalize well beyond a single layer of abstraction. 



• Multiscale consistency: Coarse scales are consistent with finer scales "in the mean", 
and each scale is a separate MDP. Semi-Markov decision processes (SMDPs) ( |Puter- 



man, 1994), for example, do not share this notion of consistency. 



• Computational efficiency: The multiscale structure we impose localizes computation 
and improves conditioning. The computational complexity of learning and planning 
can be significantly reduced, both in time and in space. 

• Coupling between learning, planning, and structure discovery: Our approach combines 
learning of macro-actions, multiscale planning, and inference of multiscale structure 
in a fundamental way. Many existing approaches focus on one or the other, resulting 
in a disconnect that leads to inefficiency and unresolved challenges. 
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• Transfer: MMDPs support systematic, scale- independent transfer of knowledge be- 
tween tasks. Knowledge may be exchanged in the form of potential operators, policies, 
or value functions. 

• Generality: Different statespace partitioning and bottleneck detection algorithms may 
be used. Compression may be carried out with respect to any policy, or collection 
of locally defined policies. Different value function representations and off the shelf 
algorithms for solving MDPs may be chosen. Key MMDP quantities may be computed 
analytically (if the model or an estimate of the model is known), or by Monte Carlo 
simulation. We do not assume specific choices of algorithms where possible. On the 
other hand, MMDPs are more constrained than SMDPs. SMDPs are very general 
objects, but this generality comes at the expense of conceptual and computational 
complexity. 

A more detailed comparison to specific work in the literature follows below. 
6.1 Hierarchical Reinforcement Learning 

Empirically, "standard" approaches to learning within flat problem spaces are often slow, 
scale poorly, and do not lend themselves well to the inclusion of prior knowledge. The 
hierarchical reinforcement learning (HRL) literature (see Barto and Mahadevan| ( |2003[ ) for 
a review, and Sutton et al. ( 1999[ ) ; [Pietterich ( 2000[ ); Parr and Russell (1998) in particular) 
has long sought to address these challenges by incorporating hierarchy into the domain 
and into the learning process. The essential goal of the hierarchical learning literature is to 
divide-and-conquer, paralleling similar strategies for coping with complexity found through- 
out biology and neuroscience. The notion of state abstraction has been considered exten- 
sively. In most cases, coarse or "macro" actions are broadly defined as temporally extended 



sequences of primitive actions. The pioneering "options" framework (Sutton et al., 1999) 



proposed a means to solve reinforcement learning problems, given pre-specified collections 



of such macro-actions. The options framework is closely related to SMDPs, see (Puterman 



1994; Das et al., 1999), and SMDPs have more generally become a modeling formalism of 



choice for HRL. If a set of options have been pre-specified, the problem of learning an op- 
timal policy over a set of options is an SMDP. Many of the hierarchy discovery algorithms 
surveyed below construct options, and then employ SMDP learning techniques. For this 
reason we devote special attention to the options framework, and discuss how it relates to 
the present work. 

6.1.1 Relation to Options and SMDPs 



In the options framework (Sutton et al. , 1999), a hierarchical value function is used to define 



a flat, global policy. One level of abstraction is typically considered: Options are policies 
accompanied by a specification as to when an option can be invoked ("initiation set"), and 
when it should end, once triggered ("termination condition"). Some elements of our work 
can be described in the language of options, however there are some important differences 
distinguishing our framework from that of options/ SMDPs. 

Options and SMDPs are general approaches, and some of our design decisions may 
be viewed as a specialization of certain aspects of options/SMDPs. Other aspects of our 
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work may be viewed as more substantial departures from the options framework. These 
differences confer computational, transfer and multiscale advantages, and promote learning 
of the macro-actions themselves: MMDPs are constructed specifically with these objectives 
in mind. The options framework does not consider learning of the options, and is primarily 
designed for planning with pre-specified macro-actions. This disconnect between learning 
and planning leads to inefficiencies, in terms of both computation and exploration, when 
SMDPs are combined with separate schemes for learning options. How to learn macro- 
actions within the options framework is a major challenge that has received considerable 
attention in the literature, yet there is no clear consensus as to how coarse rewards and 
discounts should be set locally in order to efficiently learn a macro-action that can be 
combined with others in a consistent way. Fundamentally, we take a holistic point of 
view, and couple learning of macro-actions and planning with macro-actions together. The 
hierarchical structure we consider is also tightly coupled with computation and conditioning. 
Solving for macro-actions takes advantage of improved local mixing times at finer scales 
and fast mixing globally at coarse scales. At any scale, learning is localized, in terms of 
space and computation, by way of the multiscale decomposition. The local learning is fast, 
and consistent globally, due to information feeding in from distant parts of the statespace 
through coarse scale solutions. Our framework also more easily accommodates multiscale 
representations of arbitrary depth. Each scale is an MDP and may be solved using any 
algorithm. In contrast to options, the introduction of additional scales does not necessarily 
add complexity to the planning phase (indeed, it usually reduces it). As we will discuss 
below, that each scale is an MDP also supports further transfer opportunities. 
We point out a few other salient commonalities and differences: 

• Our coarse actions, or "macro-actions", are temporally extended sequences of actions 
at the previous (finer) scale, and involve executing a policy within a particular clus- 
ter. In the language of options, the initiation set for a coarse action is any bottleneck 
connected to a cluster on which the action's policy is defined, and the policy ter- 
minates whenever a bottleneck is reached. Our coarse actions are always Markov 
(not semi-Markov), and the termination condition depends only on the current state. 
Furthermore, hierarchies of coarse actions do not lead to semi-Markov options in our 
framework - they always remain Marko\ 

However, options may only direct the agent in one direction, and may terminate in the 
initiation set of at most one other successor option. To get around these limitations, 
new, separate options must be defined, increasing the problem's branching factor, and 
care must be taken to avoid loops (if so desired). An MMDP coarse action leaves the 
"direction" of the action undecided: the same fine policy may be executed starting 
in several bottleneck states, and may take the agent in one of several directions until 
arriving at one of multiple destinations from which different successor coarse actions 
may be taken. In the context of MMDPs, if one wanted to be able to transfer policies 
on the same cluster which guide an agent in different particular directions, separate 
local policies would need to be stored in the "database" of solved tasks. However, we 
would only need to transfer and plan with one of them. 

21. This is because the homogenization we prescribe results in deterministic quantities, and Markovianity 
would not necessarily be preserved if coarse variables were random. 
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A strength of the options framework is that multiple related queries, or tasks, may be 
solved essentially within the same SMDP. However, the tasks must be closely related 
in specific ways (e.g. tasks differing only in the goal state), and this strength comes 
at the expense of ignoring problem-specific information when one only wants to solve 
one problem. Our approach to the construction of MMDPs differs in that while we 
assume a particular problem when building a decomposition, we are able to consider 
a broader set of transfer possibilities. 

Bottlenecks and partitioning do not explicitly enter the picture in options or SMDPs. 
Options may be defined on any subset of the statespace, and in applications may 
often take the form of a macro-action which directs the agent to an intermediate goal 
state starting from any state in a (possibly large) neighborhood. For example, an 
option may direct a robot to a hallway from any state in a room. We constrain our 
"initiation" and "termination" sets to be bottleneck states, however this means that 
learning policies at coarse scales is fast, and can be carried out completely independent 
of other scales. Coarse scale learning involves only the bottleneck states, giving a 
drastically reduced computational complexity. Provided the partitioning of a scale is 
well chosen, this construction allows one to capitalize on improved mixing times to 
accelerate convergence. 

MMDPs are a representation for MDPs: we cannot solve problems that cannot be 
phrased as an MDP (i.e. problems whose solutions require non-Markov policies). A 
policy solving an MMDP, at any scale, is a Markov policy. SMDPs may in general 
have non-Markov solutions (for example, policies which depend on which option is 
currently being executed). 

Multiscaleness: The options framework is arguably, at its core, a flat method. In 
general, options may reference other options, but essentially any and all options may 
be made available in a given state. In order to plan with options, one needs to know 
which option is best to execute, and at which scale, for the given task. To choose an 
option at a given "scale", options at other scales must be ruled out. In this sense, 
planning with options is "bottom- up", while our approach may described as "top- 
down". The bottom-up approach is potentially problematic for two reasons: (1) A 
domain expert has to specify coarse policies. Determining a multiscale collection of 
policies by hand can be difficult, if not intractable. (2) To learn a fine policy solving 
the problem, all options across all scales need to be considered at once potentially. 
This is accomplished by effectively flattening the hierarchy: a state-option Q-function 
would need to have entries for every option which might be initiated from a given 
state, across all scales. Scale is lost in the sense that the value at a state may only 
be defined as the value at that state under a fiat policy. Without significant user 
guidance and tailoring of the SMDP, this cannot be avoided. Either the burden on 
the user is high, or the computational burden is high. 

Even if one repeatedly defines SMDPs on top of each other, a number of difficul- 
ties arise: (1) The resulting transition probability kernels and reward functions are 
not consistent across layers. For example, the transition kernel at a coarse scale 
is not the transition kernel of the embedded Markov chain observed only on initia- 
tion/termination/goal states. (2) An SMDP could have a disconnected statespace at 
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the coarse layer, and it is not clear how this problem can be resolved. (3) The lack of 
isolation of scales necessarily implies an increase in the number of actions, and thus 
an increase in the branching factor of the problem. 

For these reasons, the extension of options to multiple levels may not be easily carried 
out, and is not often seen in the literature. By contrast, our approach is strongly 
multiscale. We impose a specific, stronger form of hierarchy, that abstracts each scale 
away from the others. At a given scale, coarse actions may refer to fine actions, but 
only through the fixed multiscale organization, and they cannot invoke coarse actions 
at or above the current scale. The multiscale structure is always enforced. This leads 
to significant computational savings, and a form of consistency of problems and their 
solutions across scales. 



• Options/SMDPs define a coarse scale transition probability law which combines tran- 
sition probabilities between a macro's starting/ending states, the trajectory (sojourn) 
length distributions, and uniform discounting (multi-time model). This definition suf- 
fices if the options are user-specified, and one is only interested in a single layer of 
abstraction. The definition is problematic, however, for multi-layer hierarchies, non- 
uniform discounting, transfer learning, and learning of the options themselves. In 
our construction, coarse transition probabilities and discount factors are computed 
separately. One advantageous consequence of this is that path length distributions do 
not need to be estimated or represented explicitly. But it is also important to keep 
these quantities separate for the following reasons: (1) Transfer (detection, trans- 
fer of potential operators, and partial policy transfer); (2) One needs to modify the 
transition probability tensor in order to restrict to a local cluster, and solve local- 
ized sub-problems efficiently; (3) It is important to preserve multiscale consistency: 
a compressed MDP is an MDP that is consistent with the fine scale in the mean. 
This is not true of the quantities defined in the context of SMDPs. If there are non- 
uniform discounts, then it is the product of the discounts over trajectories that must 
be considered rather than a constant raised to the path length (see Section 3.2.5 for 
a discussion concerning this distinction). 



• Options/SMDPs define a coarse reward function which does not depend on the ter- 
mination state; aggregate rewards are pre-averaged over all possible ending states. In 
the context of learning and transfer, this choice can lead to serious errors. Consider 
the effect of averaging over paths ending at the starting state (small reward) with 
paths spanning a large cluster (large rewards). It is likely that with such a system 
of rewards, coarse solutions would contain little information for solving at the fine 
scale. In any event, this definition does not yield multiscale consistency in the sense 
discussed above. MMDPs keep track of the coarse reward for each possible starting 
and ending state, and these quantities are approximated by analytically computed 
moments given a model or by Monte-carlo estimates. Space requirements are small, 
however, because only bottlenecks, of which there are few, can be termination states 
for a macro-action. This convention also ensures multiscale consistency. 
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6.1.2 Other HRL Approaches 



The MAXQ algorithm (Dietterich, 2000) is a method for learning a collection of policies 
at each layer of a programmer-specified hierarchy of subroutines, using a form of semi- 
Markov Q-learning. Two types of optimality are discussed: recursive optimality, where 
each sub-problem's policy is locally optimal, but the overall solution may not be optimal, 
and hierarchical optimality, where the global policy is optimal given the constraints the 
hierarchy imposes. We consider optimality with respect to the true, global optimum in the 
set of all stationary, Markov policies, and have discussed algorithms for solving MMDPs 
above which converge to optimal policies at the finest scale. More recent work (Kaelbling 



and Lozano-Perez , 2011 ) has begun to explore hierarchy as a means for reducing the amount 



of search that is required to learn and/or construct a hierarchical plan. In this paper, we do 
not consider using a (partial) MMDP to speed up exploration and subsequent elaboration 
of itself, although this is an interesting avenue we leave for future work. 



Hierarchical learning in partially observable domains has also been consid ered (jTheocharo us 
and Kaelbling! |2004| |He et al.[ [20Tl] [Pineau et al.[ |2003t [Kurniawati et al.[|2009D , but is a 
less developed topic. He et al. ( 2011[ ) is noteworthy in that they consider online learning 
with macro-actions in POMDPs, with a particular emphasis on scalability, although macro- 
actions must be pre-specified by the user, and are constrained to be open-loop sequences. 
Our coarse actions may also be seen as open-loop controllers, but they only end when a 
bottleneck state is reached. 



6.1.3 Hierarchy Discovery 

Hierarchy is important in the literature referenced above, but the meaning of the hierar- 
chy and its geometric interpretation is often detached from the solution process. If the 
hierarchy must be provided by a domain expert, the solution algorithms can only make 
limited assumptions about what the user has provided. In this paper, structure discovery 
and learning are intimately connected. Hierarchies are (automatically) defined based on 
the geometry and goals of the problem, and this is exploited to achieve locality of the com- 
putations and scalability, and to create opportunities for transfer. Much of the early HRL 
research primarily sought to define algorithms for learning given user-specified hierarchies, 
and only later did researchers consider automatic discovery and characterization of task 
hierarchies. As a result, many approaches to structure discovery appear to rest on top of 
generic HRL frameworks (such as options), and lack synergy with the underlying learning 
process. Within this collection of approaches, there are however several ideas which overlap 
with portions of our work. The literature concerning automatic detection of macro-actions 
and/or hierarchies can be roughly organized into three categories: (1) Approaches which 
aggregate states based on statistics observed at individual states during simulation, (2) 
approaches involving graph-based clustering/analysis, and (3) approaches based on "exper- 
imentation" or demonstration trajectories not necessarily directly related to the task to be 
solved. 



The work of Stolle and Precup (2002) recognizes a form of "bottleneck", there defined 
to be states which are visited frequently. The authors propose a heuristic algorithm which, 
given simulated trajectories, takes the top most frequently visited states as bottlenecks, and 
uses these states to define options. Others have attempted to capture similar properties. 
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The HEXQ (Hengst, 2002) and VISA (Jonsson and Barto, 2006) algorithms group states 
based on the frequency with which their values change. Marthi et al. (2007) perform a direct 
greedy search for hierarchical policies consistent with sample trajectories and an analysis of 
changes in states' values. However, a commonly occurring problem with these approaches 
is that they are computationally intensive and can require large trajectory samples. Explo- 
ration is global in these approaches, which could be problematic for large, complex domains. 
Both VISA and the HI-MAT approach of Mehta et al. (2008) (which automatically creates 
MAXQ hierarchies) require estimating and analyzing a dynamic Bayesian network in order 
to determine clusters of mutually relevant states. Approaches which assume DBN transition 
models allow for compact representations, but also lead to a solution cost that is exponen- 
tial in the size of the representation unless specialized approximate algorithms are used. 
In addition, some of these approaches do not maintain a principled or consistent notion of 
scale. 

Our intuitive notion of a bottleneck state in Algorithm [l] is close in spirit to several 
other graph-theoretic definitions appearing in the literature, although we emphasize that 
we have not designed our approach to HRL around any single characterization of bottle- 
necks. Several graph clustering algorithms are proposed in (Menache et al., 2002; Mannor 



et al., 2004) for identifying subgoals (online) to accelerate Q-Learning. The latter reference 



employs a form of local spectral clustering (local because good bottlenecks may not always 
be part of a global cut) and is related to the work of Osentoski and Mahadevan (2010), 
while the former proposes a clustering method that can take advantage of the current value 
function estimate. In these papers a weighted graph is periodically built from observed state 
transitions and then cut into clusters. Options are learned so that neighboring pieces of 
the graph can reach each other. Although policies on clusters (options) are computed sepa- 
rately, they are computed on the basis of an artificial reward, and can therefore be incorrect. 
The work of Simsek et al. (2005) is similar, and considers local spectral clustering on the 
basis of a limited, recent collection of trajectory samples. The statespace is successively 
explored and bottlenecks are identified without having to perform global computations. 
Options corresponding to the clusters resulting from graph cuts are again learned, but may 
also be incorrect, so re-learning of the options is prescribed. Another approach, distinct 
from spectral clustering, is the identification of bottleneck states based on "betweenness" , 
proposed by Simsek and Barto (2009). The idea follows from the observation that bottle- 
necks may not always be identifiable based on node connectivity. Bottlenecks are defined 
to be states through which a large fraction of graph geodesies must pass. States within a 
small neighborhood with comparatively high betweenness are identified as bottlenecks, and 
options are defined on the basis of these subgoals. Betweenness is a natural alternative to 
diffusion based clustering techniques, but can give substantially different results depending 
on the geometry of the statespace and how the graph weights are chosen. Each of the clus- 
tering methods described may also be used within our framework to choose bottlenecks and 
partition the statespace, and in online exploration scenarios the current statespace graph 
may be re-estimated (and the MDP re-compressed) as desired. 

The third category of hierarchy discovery research may be represented by the intrinsic 
motivation work of Singh et al. (2005); Barto et al. (2004), and the skill discovery method 
of 'Konidaris and Barto (2009). A significant difference between these references and the 
work described here is that, while coarse structure is used to decompose the fine scale 
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problem into more manageable pieces, there is no explicit independent coarse problem that 
is solved and pushed downwards in order to guide/accelerate the solution at the fine scale, 
and only one level of abstraction is considered. In ( [Singh et al. , 2005), an agent discovers 



skills by experimenting within the domain and receiving rewards for actions which lead to 
novel, salient, or intrinsically interesting outcomes. The learned skills then serve as options 
within an SMDP framework, to learn policies that can accomplish extrinsically rewarded 
tasks. The authors consider two layer hierarchies (one level of abstraction), and require 
manual specification as to which events are salient and how they are rewarded. This work 
differs from ours in that the sub-tasks we learn are tailored to a particular goal. In our 
development, nothing is required from the user and learning may be faster, but sub-task 
solutions may not as easily transfer to other problems. One possible way around this would 
be to solve multiple MMDPs corresponding to different objectives within the same domain, 
and store the cluster specific solutions in a database. 

The skill chaining method proposed by |Konidaris and Barto (2009) seeks to learn skills 
in continuous domains by working backwards from a goal state, and is particularly useful 
under a query paradigm in which multiple, closely related problems need to be solved (for 
instance, involving different goal states). The skill discovery process is local in the sense 
that only a neighborhood around the previous milestone is considered in defining a new 
skill. The notion of chaining together skills is similar to the constraint we have imposed 
stipulating that coarse actions can only be taken in bottleneck states: the initiation/terminal 
sets of given skill can only be initiation and/or terminal sets of other skills. In a subsequent 
paper (Konidaris et al., 2012a), the authors extend skill chains to trees of skills. However, 
the trees refer to the arrangement of skills within a single level of abstraction, and does not 
refer to a tree of scales. Skill chains and trees may be described as effectively imposing a 
localized representation for the value function, driven by the geometry of the problem. By 



using localized basis functions to represent the value function in a flat model (Mahadevan 



and Maggioni, 2007; Osentoski and Mahadevan, 2010), it is likely that similar structure 



could be captured. On the basis of these observations, it is possible that the recursive 
spectral clustering algorithm described in Section 3J_ can lead to a hierarchy of sub-tasks 
respecting similar geometric properties as that of the skill trees in (Konidaris et al., 2012a) 
but at more than one level of abstraction. 

The references above propose different heuristics for defining how and when options 
(or other sub-goals) are learned and/or updated. However, a major challenge is to define 
an isolated problem whose solution can be obtained efficiently (i.e. locally), but is still 
consistent with other macro-actions. In several of the approaches above, the rewards used 
to learn the subtasks and the values fixed at bottleneck states may not be compatible, in 
which case policies can conflict (with respect to a designated goal) across subtasks. The 
multiscale framework described here provides a principled way to learn consistent local 
policies, given a partition of the statespace into subtasks. The reward function, boundary 
values, and discounts are determined automatically in our setting. Consistency is also 
maintained across scales so that the process may be readily repeated as necessary. If a 
single macro-action is itself a large problem, it is not clear how the methods above can be 
extended to another scale because of scale-dependent assumptions. The learning algorithms 
we have proposed are independent of the scale at which they are applied. Finally, it is often 
the case that non-standard algorithms are required to solve a given problem, once a set of 
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subtasks is identified. Our approach provides a hierarchy of ordinary MDPs which can be 
solved using standard techniques. 

A more integrated approach to HRL with automatic construction of the hierarchy is the 



recent work of |Barry et al. (2011). The DetH* algorithm proposed in (Barry et al., 2011) 
shares some common themes with our work (and we like the name), although there are also 
pointed differences. DetH* uses a type of coarse policy (a deterministic map between coarse 
states) to decompose the fine scale problem into sub-problems (extending beyond two layers 
is not discussed), but does not determine a coarse value function that can be used to solve 
the independent sub-problems in a manner which maintains global consistency. A type of 
local optimality is, however, guaranteed. The end product at the coarse scale is not an 
MDP, and the complexity of the proposed coarse solution algorithm depends on a quantity 
similar to the worst case fine scale cluster diameter. The multiscale MDPs proposed here are 
hierarchies of independent, self-contained MDPs. DetH* clusters the statespace by trading 
off the size of the clusters against a reachability condition. Heuristics are applied to ensure 
that clusters do not become too large or too small. The heuristics indirectly attempt to 
settle on an appropriate scale, but it is not immediately clear how geometry enters the 
picture. "Too-large" appears to mainly be a computational consideration (i.e. the number 
of states). Our perspective is that it is perfectly fine, indeed desirable, for clusters to be large 
- depending on the problem geometry or other notion of intrinsic complexity. For example, 
if the Markov chain associated to a policy is fast mixing within the room. More precisely, 
the number of states in a cluster is not a computational problem so much as conditioning is. 
The approach taken in our work seeks to find the right scale and partitioning based directly 
on local geometry, and by extension, conditioning. 

Another difference worth mentioning is that DetH* defines coarse states to be repre- 
sentatives of entire fine scale clusters, and goal states are also lumped into a single macro 
goal state. In our framework the coarse states are bottlenecks, and are elements of the fine 
scale statespace. Because the coarse states in DetH* are sets, the authors define a cost 
function on coarse states based on averages of shortest paths between clusters. The extent 
to which the underlying dynamics can captured is not clear, and the coarse quantities are 
not consistent with the fine scale in a precise, probabilistic sense invoking an underlying 
Markov chain. In addition, DetH* determines a hierarchy on the basis of shortest paths, 
and cannot consider a coarsening with respect to a particular policy. Transitions between 
coarse states are deterministic, whereas we construct a coarse problem which is itself an 
MDP, having its own transition kernel. This allows for greater generality and encompasses 



a richer set of problems. Finally, Barry et al. (2011) and the other references above do not 



consider transfer within their respective hierarchical frameworks. 



6.2 Transfer Learning 



A good overview of the current landscape for transfer in reinforcement learning is (Taylor 



and Stone, 2009); we provide only a brief summary of some efforts related to ours. The 
literature discussed up to this point has been primarily concerned with discovering state 
abstractions for a specific problem, and transfer may be possible only to problems in the 



same domain, if at all. The approach taken in (Konidaris et al. , 2012b; Konidaris and Barto 



[2007^ posits the existence of shared features across related tasks, and discusses transfer of 
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value functions defined from those features (for example, the features may be coefficients in 
a basis function expansion). The shared features constitute a representation which simulta- 
neously captures relatedness among tasks and solves any relevant correspondence problems. 
Transferred value functions serve as shaping functions for learning new tasks. Reward sys- 



tems are the same across tasks. In some instances, options are transferred (Konidaris and 



Barto, 2007), again given a suitable feature space, but the option transfer is limited to a 



single level of abstraction. The authors do not discuss how to identify a suitable feature 
space ("agent space") to carry out transfer, although this is a crucial element and deter- 
mines what kind of transfer is possible, and the degree to which transfer helps in new tasks. 
The earlier work of Guestrin et al. (2003) is related to that of Konidaris et al. (2012b) in 
that sets of similar problems are defined from the ground-up using a common, class-based 
formalism. In the language of Konidaris et al. (2012b), Guestrin et al. (2003) specify prob- 
lems with a specific, predetermined feature space in mind, so that value functions defined on 
the features immediately transfer to tasks within the class, by construction. The approach 
is not, however, hierarchical. 



Ferguson and Mahadevan (2006) (also Tsao et al. ( 2008[ )) do not consider transfer in a hi- 



erarchical setting, but take an approach that can also be incorporated into our development. 
In their approach, eigenfunctions of a graph Laplacian describing the domain are transferred 
(the graph is constructed from trajectories). These eigenfunctions (or "proto- value func- 
tions") serve as basis functions for defining a value function over the statespace (iMahadevan 



and Maggioni, 2007|. Transfer is considered among tasks with identical domains but var- 



ied reward functions, or among tasks with fixed reward function and geometry, but scaled 
statespaces. Since subtasks (coarse or fine) within an MMDP are themselves MDPs, one 
may also select various basis functions on which to expand the local value functions specific 
to clusters in our framework. This extension of the basic MMDP solution methodology we 
have described (Section |3.3[ ) can be applied at any scale, based on graph Laplacians derived 
from either simulations, or from a transition model P. The resulting proto-value functions 
may be stored as part of the solution to a sub-task in the library of transferrable objects, 
and transferred when appropriate. 



Multiscale MDPs, as defined in this paper, contrast with the approaches above in that 
transfer may be pursued at any scale (out of many), and the procedure for carrying out 
transfer is the same regardless of scale. We support transfer of coarse or fine scale knowl- 
edge, or combinations thereof. In addition, we have attempted to automatically handle the 
problem of systematically identifying transfer opportunities and encoding the knowledge to 
be transferred. Although value functions (defined with basis functions or otherwise) may 
be transferred within our framework, transfer can also take the form of policies or poten- 
tial functions, so that transfer can occur between more dissimilar tasks. We still, however, 
require statespace graph matchings, which may be challenging to obtain depending on the 
problems and scales under consideration. Finally, refinement and improvement of a trans- 
ferred quantity is straightforward when working with MMDPs. Any algorithm may be used 
to improve the policy, since problem and sub-task representations are always independent 
MDPs, both within and across scales. 
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7. Discussion 

We have presented a general framework for efficiently compressing Markov decision prob- 
lems, and considered multiscale knowledge transfer between related MDPs. Our treatment 
is multiscale, and centers on a hierarchical decomposition in which coarse scale problems 
are independent, deterministic MDPs, and in which local sub-problems at fine scales may 
be decoupled given a coarse scale solution. We then argued that such multiscale represen- 
tations may be used to efficiently solve a problem, and to transfer localized and/or coarse 
solutions rather than global solutions. The experiments we considered demonstrated com- 
putational speedups as well as the transfer of localized potential functions and policies at 
both coarse and fine scales across three example domains. As one would expect, problems 
receiving transferred information were shown to be solvable using less computational effort. 

In the subsections below, we address a few generalizations and suggest outstanding 
directions for future consideration. 



7.1 Model-based vs. Model- free Learning 



The compression procedure introduced above assumes access to a pre-specified model de- 
scribed by P, R and F, although it is only mildly model-dependent in the sense that compres- 
sion involves averaging so that the "model" is not needed to high precision. The assumption 
that the model is known may be relaxed entirely, however. The general multiscale approach 
to compression we have described may be extended to include a completely model-free 
setting by considering a fully empirical, Monte-Carlo based compression and bottleneck 
detection regime. Bottlenecks may be initially detected on the basis of a local exploration 



(see for example Spielman's local heat flow algorithm (Spielman and Teng, 2008) or Peres' 
evolving sets (Andersen and Peres, 2009; Morris and Peres, 2003")), so that the entire P 



matrix is not needed. The exploration may be done starting from a goal state, for ex- 
ample, and would be inexpensive because only the cluster enclosing the goal state needs 
to be considered. Given the bottlenecks, Monte-Carlo based simulations can be used to 
compress the MDP locally in the vicinity of the starting state by directly estimating the 
coarse ingredients (transition probabilities, rewards, and discounts). The process may then 
be repeated starting from the detected bottleneck states, proceeding outwards, to build up 
a global picture successively adding one (or a few) clusters at a time. On-policy exploration 
could be accelerated in previously explored regions by using the compressed model as a fast 
simulator. This approach could make difficult problems, where long sequences of actions 
are necessary to reach the goal, more tractable. 



7.2 Continuous Domains, Sampling, and Dictionary Expansions 

We have assumed discrete state and action spaces, although it should be emphasized that 
the multiscale development here does not critically depend on the discrete assumption, and 
may be adapted to continuous domains. A simple approach, which was considered in Sec- 
tion |5.2| is to discretize and then build a model on a discrete set of representative states. 
The discretization is in general problem dependent, and need not be dense in order to ap- 
ply the homogenization prescribed above. A continuous problem could be discretized with 
a coarse sampling determined by the problem's complexity, with the expectation of good 
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results since complexity (and geometry) largely determines the multiscale decomposition. 
The translation from continuous to discrete could be model-based (e.g., eigenfunctions of 
P) or model- free (e.g., eigenvectors of the graph Laplacian built from simulated trajecto- 
ries). Moreover, in this case the coarse MDPs have discrete statespaces, so that handling 
continuous variables is only a concern at the finest scale. 

Another approach might be to overlay a discrete coarse MDP on top of a continuous 
fine scale problem. In this case, the fine scale quantities (including policies) could be repre- 
sented by expansions on basis function dictionaries for the statespace, or even with factored 
representations and neural networks. One could then consider collapsing bottleneck regions 
(as sets) into single coarse states, and compressing based on fine scale trajectory statistics 
between these regions. Depending on the form of the model, it is possible that analytical 
expressions for the coarse quantities can be obtained (for example, if one uses Gaussian basis 
functions to describe P, i?, leading to Gaussian integrals). The local boundary value prob- 
lems occurring at the fine scale could be amenable to solution with existing approximate 
DP algorithms. 



We point out that adopting basis function representations for the model could also 
support a broader set of transfer possibilities. Basis functions may themselves be transferred 
locally (representation transfer), in addition to solutions. A careful choice of basis can also 
impart invariance properties, and provide a means to accommodate domain changes (e.g., 
scaling, by way of Nystrom extensions, and goal changes) and reward function changes across 
problems (see Ferguson and Mahadevan (2006) for a discussion related to representation 
transfer) . 



7.3 Partially Observable Domains 



Partially observable MDPs (POMDPs) can be cast as fully observable MDPs on a continuous 
belief statespace, so that in theory POMDPs can be decomposed, solved and transferred 
using the framework discussed in this paper. In practice, solving belief MDPs exactly 
can be computationally prohibitive. Extending the multiscale MDP framework described 
above to POMDPs in a more fundamental way could lead to efficient approximate solution 
algorithms. For instance, solutions to more tractable coarse problems could be used to 
provide interpolated solutions to finer problems, where accuracy vs. complexity of the 
interpolation can be balanced locally. 
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Appendix A. Derivation of the linear system describing a value function 

Let TT be a policy on S. Here we prove equation Q: 

where we recall that is the value function defined in ([s]), 

00 rt-i 

Vis) = E R{so, ai, + X] S n ^^+1' '^^+1) f ^^+1' = 5 

t=l [t=0 J 

Applying the Markov property to the first expectation on the right-hand side 
E[i?(5o, ai, si) I 5o = 5] = ^^P(5i = 5', ai = a\so = s)R(s^ a, 5') 

= P(5, a, 5')7r(5, a)R{s, a, 5'). 



For the second term, we have 
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i?(5i,a2,52) + 



50 = 5, 51, ai 



50 = 5 



= ^si,ai [r(5o, ai, 5i)y (51) I 50 = 5] 



Putting these results together, we obtain the hnear system of equations 

= P{s, a, s')7r(s, a) [R{s, a, s') + r(s, a, i « ^ 5 , 

which is Q. 

Appendix B. Analytical results and computational considerations for the 
compression step 

B.l Compressed transition matrix P: Proof of Proposition [l] 

Let a E A be the coarse action corresponding to executing a policy ttq G tTc in cluster c, 
so that {Xt)t>o is the (discrete time) Markov chain on the cluster c with transition matrix 
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P^^. Recall that the set of bottleneck states within the cluster is denoted dc C and the 
set of non-bottleneck (interior) states in the cluster is denoted c := c \ dc. 

First, observe that if 5 ^ c, then the entries P(5, a, •) are not defined because the action 
is unavailable. Second, if s' ^ c, then we know that P{s^a^s') — 0. Therefore we restrict 
our attention to pairs 5, G c, i.e. 5, s' compatible with a. The transition probabilities 
among pairs of states in dc are computed by observing the Markov chain (X^)t>o at the 
hitting times of dc\ 

Tm = inf{t > T^_i I Xt G ^c}, m = 1,2, . . . 

with To = inf{t > | G ^c}. The hitting times are a.s. finite (P(T^ < oc) = 1, 
Vm > 0) in light of the fact that, by construction, absorbing states are bottlenecks, and 
the assumption that dc is vr-reachable from any starting point. A new chain {Ym)m>o 
taking only values in dc can now be defined as Yjn = Xt^. The transition probability 
matrix governing {Ym)^ is computed from that of (Xt)t by solving a linear system for a 
few different right hand sides as follows. Let Fs{B) := F{B | Xq = 5), for any event B 
(measurable w.r.t. a suitable cr-algebra). Consider the hitting probabilities Fs{Xtq = s^). 
Clearly Fs{Xtq = s') = Sg^s' foi" s^s' G dc, where 6 denotes the Kronecker delta function. 



The strong Markov property (see for instance Norris (1997, Thm 1.4.2)) allows one to apply 



the Markov property at (finite) stopping times, so that a one-step analysis gives the hitting 
probabilities for 5 G c, 5' G 5c as 

P,(Xto = ^0 = E[P,(Xto = I Xi, ai) | Xq = s] 

= ^Pc(5,a,5'0^c(5,a)P,(XTo = I Xi = 5^ai = a) 

= J2 Pc{s, a, s)7Tc{s, a)+ s')7Tc{s, a)P,//(XTo = s)- 

The third equality follows from the second applying the fact that Xtq is independent of 
ai given Xi, and the strong Markov property. Summarizing, these probabilities may be 
computed by solving the linear system 

P,(Xro ^s)^ 1^^^^^^ ^ ^^^^^^ ^^^^^^^ s")F,.{Xt, ^s') set. ^^^^ 

By the strong Markov property, we also have for s, s' G dc, 

P{s, a, s') = P(y^+i ^s' \Y^^s) 

= P(Xr„+i = s' I Xt„ = 5) 

= P(Xri = I Xt, = s) 

= P(Xri ^s' \Xo^s)^ ¥s{Xt, = s')- 



75 



J. BOUVRIE AND M. MAGGIONI 



The law of total probability applied to the right-hand side of the third equality gives, for 

F{Xt, = I = 5) = E[P(Xt, = I = 5, Xto+1, aTo+i) | Xt, = 5] 

= ^ i^c(<s, a, 5^0^0(5, a)P(XTi = I = 5, Xtq+i = aTo+i = a 

= Pc'is, s")¥iXT, = 5' I Xt, = 5,Xto+1 = s") 

s"Gc 

= P,-^(., .0 + E ^c"^(^' ^')P.^K^To = ^0. (24) 

The third equality follows from the second using the fact that Xt^ is independent of aj-o+i 
given Xto+1. 

Noticing that (P(Xti = \ Xtq = <s))5^5/^ac depends on (P5(^To = "^0)5,5/^2' ^ut the 
latter do not depend on the former, we can combine Equations ([23]) and ([24]) into a single 
linear system for each 5' G dc: 

Hs,s' = {s, s') + J2 s")H,n^,, , sec,s' e dc. (25) 

We then have 

P{s, a, 5') = Hg^s'i foi" all ^ 5c, 



assuming is the minimal non-negative solution to (25), and a is the action corresponding 
to executing the policy tTc in cluster c. Consider the partitioning 



CD)' \hb 

where the blocks Q, D describe the interaction among non-bottleneck and bottleneck states 
within cluster c respectively. In matrix-vector form, we can solve for the compressed prob- 
abilities by computing the minimal non-negative solution to 

(/ - Q)hq = B (26) 

followed by 

hb = D + Chq, 

where /i^ is the desired transition probability matrix of the compressed MDP given the 



IS 



action a. If Equation ( [26] ) has a unique solution, then the cost of these computations 
at most (9(|cp + |5c||c|^)T^If solving the hnear system (26) does not produce a non-negative 
solution, then algorithms for non-negative least-squares must be used. 

From these expressions, it is clear that the transition probabilities starting from non- 
bottleneck states hq do not depend on those starting from the bottleneck states or on entries 
of P^^ outside of Q. In addition, by definition of the stopping times above, the transition 
probabilities enforce P{s, a, s^) = ^5^5/ whenever s is absorbing. 



22. Using for instance, an LU factorization ((^dcl*^)) to efficiently solve for |^c| <^ |c| right-hand sides at a 
cost of (9(|c|^) each. 
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B.2 Compressed rewards R: Proof of Proposition [3| 

We will first need to define a controlled Markov process conditioned on future events. The 
approach taken here is similar to that of the Doob /i-transform (see [Levin et al. ( 2008| )) for 
Markov chains, but differs in that we keep track of the actions. We fix s' G dc. Consider 
the event {Xtq ^ s'} and define 



5 G c, 



(27) 



with the probabilities fs{XTQ = s') given by Equation (23). It can be shown that the 
function hg' is PJ^^-harmonic. Using Bayes rule and the strong Markov property, 



^s{Xt, = ^0 
_ ^s"{Xt,^ s')P^{s,a,s")^c{s,a) 

^s{Xt,^s') 
_ Pc{s,a,s")Tic{s,a)hs'{s") 
hs'{s) 

for s^^ G Cs' := {5 G c I hs'{s) > 0}, 5 G c'^, \ {s^}, and a e A. 
Similarly, for (5, s') G supp^(P) C 5c, s'' G c^/, a G A, 



(28) 



^(-^To+l = <s'^ <^To+l = I Xti 



P(Xti = 5' I Xto = 5,Xto+1 = 5'',aTo+l 



Pcis, a, s")7rcis, a)hs'{s") 



(29) 



Since 



P(5,a,50 
=:P^^,(.,a,.^0 



if s^' G dc 



P^„(Xto = if G c 



is equal to hsf{s) defined by Equation (27), and for 5 G 5c we have Fs{Xti = s') = P(5, a, 5') 
as given by Equation (24). 

We now consider the expected rewards collected along paths between bottlenecks con- 
nected to a cluster. The process is similar to that of the transition probabilities, where we 
first defined hitting probabilities at time Tq, and from those quantities defined conditional 
hitting probabilities at time Ti. Here we use discounted expected rewards collected up to 
time To to ultimately compute rewards collected between Tq and Ti. Recall that we assume 
a reward is collected only after transitioning. Let T and be two arbitrary stopping times 
satisfying < T < < oc (a.s.). The discounted reward accumulated over the interval 
T <t <T' IS given by the random variable 



T'-l 



:= P(Xt, ttT+i, ^T+i) + 

t=T+l 



t-1 



.T=T 
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where a^+i ^ 7Tc{Xt) for t = T, . . . , — 1, and we set = for any T. 

Consider E,[i?oO | Xt^ = s'] for some fixed G dc. We immediately have that 
Ks[Rq^ I = 5^] = if 5 = 5^, and is undefined if 5 ^ c^, := {5 | hsf(s) > 0} (note 
that {dc \ {s'}) ^ (c \ dg/) from ([23])). We will need the following Lemma. 

Lemma 6 For 5 G c H c^, 5' G 5c, s'' G c^, a G A, 



E[R^' I Xto = s\Xi = 5^ro > 1] = E,.[i?^^ I Xto = / 



To 



and therefore, 

Es[Rq' I Xto = 5^ Xi = ai = a] = a, + r(5, a, s'O^.^^^^^ I ^To = s]- 
Proof [Proof of Lemma [g] We have 
Es[R^° I XTo^s',Xi^s",ai^a] 



R{s,a,s") + r{s,a,s")E,s 



To-l p-l ^ 

XI 1 n ^(^-T' ci-r+i, -'^T+i) \ R{Xt, at+i,Xt+i) I = s', Xi = s", ai = a 



t=2 kr=l 



= i?(s, a, s") + r(s, a, s")E[R{° \ Xt^ = Xi = s", Tq > 1]. 

If s" = s' then there is nothing more to show, since i?i = 0. For 5'' G cHc'^,, it will suffice to 
show that E[i?^° | Xi = 5^', Tq > 1] = E^z/fi?^*^]. Given a sequence of states (z^, . . . , in+p-m) 
and actions (a^+i, . . . , a^+p_^) with < n,m < p, define the event 



/p—m 



p—m 



BP 



\j=o 



and consider the conditional probabihty, for n > 1, 



P({ro = n} n B{°^ I Xi = s". To > 1) = P(S/° I To = n, Xi = s")P(r = n | Xi = s" , Tq > 1) 

= F{Bl^ I Xi = s", X2 ^ 9c, . . . , Xn-i i dc, Xn e dc) 

X P(ro = n I Xi = s", To > 1) 
= P(Sj° I To = n - 1, Xo = s")P(7o = n - 1 I Xo = s"^ 
= P({ro = n - 1} n I Xo = s"), 

where we have used the fact that P(ro = n | Xi = s", Tq > 1) = P(ro = n - 1 | Xq = s"). 
This latter equality is true, since, by homogeneity. 



Xn 



(n-1 
f] {Xj ^ dc} n {Xn G dc} 
j=m 

(n—m—1 
fl {X,- ^ ac} n {X„_^ e ac} 
i=o 

= P(ro = n - m I Xo = s") 
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for n > m. Next, let 

/(zo, . . . , in, tti, . . . , a^) := i?(zo, ^i, ii) + ^ 



n-l 



t=l 



t-1 



.r=0 



Then, assuming Tq < oo a.s., we have 
E[i?fo I Xi = s'\To > 1] 



l<n<cx) ii,...,in£c 



= J2 J2 n{To = n-l}nB^°,\ Xo = s")fih,...,in,a2,...,an) 

l<n<oo ii,...,inGc 

= n{To^n}nBl°,\ Xo^s")f{io,...,in,ai,...,an) 

0<n<oo io,...,in^c 
ai,...,an^A 

= E[rI' I Xo = s'] . 



With the above definitions, we turn to proving the proposition. 
Proof [Proof of Proposition [s] With the same choice of used in Ks[Rq^ \ Xtq = 5^], define 
hgf as in Equation (27) and let as above c^, := {5 G c | hsf{s) > 0}. For 5 G cHc^, = c^/\{5'}, 
a one-step analysis gives 



E,[i?^° I Xt, = s'] 



= J2 ^s{Xi = s", ai = a I = s')Es[RJ;° \ Xt, = Xi = s", ai = a] 

= Yl ^K' «' [i?o° I ^To = s', Xi = s", ai = a] 

s' 

= PhAs,a,s"){R{s,a,s") + r{s,a,s")E,4R^° \ Xt, = s']) 

s' ' 



where the third equality follows from the second using Equation (28) and the fact that 



Ph^, (5, a, 5^') > only for G c'^/, and the fourth follows from the third applying Lemmajoj 
The last equality follows after rearranging terms. 

With these expectations in hand, we can compute the discounted rewards between 
bottlenecks. Note that E[i?g | Xt^ = s,Xt^ = 5'] = ^s[RtI I = s% for 5,5' G ^c. 
By convention, we set 'Ks[R^^ \ Xti = 5'] = if P(s,a,s^) = 0. For 5 G 5c such that 
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(5, s^) E supp^(P), we have Tq = 0, and a one-step analysis similar to the above gives 

Es[R^l I Xt,=s^] 

= E^jE^fi^^J I = 5^XTo+i,aTo+i] I Xt^ = 5'} 

where we have used the fact that 

^s[RtI I = s,Xto+i = s\aTo+i = a] 

_ fi?(5, a, 5^0 + r(5, a, s'')Es4Rq' I = 5^] if s^' E c 
1 R{s, a, 5') 



As mentioned in Section 3.2.3, the boundary is reachable from any state by assumption, so 



^{Tm < oc) = 1 for all 5 G c,m < oc. Hence, the solution to the linear system above is 



unique and bounded (Norris, 1997, Thm. 4.2.3) 



We express the linear systems describing the coarse rewards above in matrix-vector 
form for convenience. Fix a destination bottleneck 5' G dc. Consider Ph^, , 
sors, and partition P^^, into the pieces {P^ ^)s,a,k — Ph^f{s,a,k) for s,k ^ cDd^f^a G 
A and (P^ ^)s,a,k — Ph^/{s^a^k) for 5 G cnd^,^k = 5',a G A. Next, partition ^ 
into the pieces {P~ )s,a,k = , (<5, a. A:) for (5,5') G supp^(P),A: G cnc^/,a G A and 

fig/ S 

(P^ )s,a,k — Pfi , (s^ a. A:) for (5, 5') G supp^(P), A: = 5', a G A. Similarly, partition F^, P^ into 
pieces F^ ^ , F^ ^ , F~ , F~ and P^ ^ , P^ , 5 ^? 5 corresponding to the respective pieces 



of Ph mentioning the same state/action triples (5, a, 5'). Equations (10a) and (10b) 

s tig/ 1 r 1 

may be written, respectively, as 

(/ - {PI o ri,r)h, = [{PI o R^x^ {Pi o Rir] 1 

- (p-l ° rs)-^ + [{p^ o R^r (<, o Rir] 1. 

Two final observations of practical interest are in order. The solution of the linear 
systems defined above (one for each destination bottleneck 5' G dc) can be potentially 
carried out efficiently by preconditioning some systems on the basis of solutions to the 
others. In particular, if a particular bottleneck 5' defining a system is close to another 
in the statespace graph in terms of the diffusion distance induced by PJ^% then there is 
good reason to believe that the solutions will be close. Second, calculation of the rewards 
above is closely related computationally to calculation of the coarse discount factors. The 
next section derives the discount factors, and discusses when one set of quantities can be 
obtained from the other at essentially no cost. 
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B.3 Compressed discount factors F: Proof of Proposition [4] 

Proof [Proof of Proposition |4] The approach is similar to that of the rewards. First note 
that if we set R{s, a, s^) = 1 uniformly for all 5, 5' E c, a E A, then A^' = — R^' (and 

+ 1 is clearly a stopping time, so this quantity is well-defined). If 5 E cnc'^/, E dc, s" E 
c^, a E A, then invoking Lemma pi 



E,[A^° I Xto = s', Xi = s", ai = a] = V{s, a, s")E,[A[° | = s', Xi = s"] 

= r(s, a, - I = s', Xi = s"] 

= r(s, a, - I Xto = s'] 

= r(s,a,s")E,»[A^° I Xt, = s'], 

and Es[Ao° | = s',Xi = s",ai = a] = r(s,a,s') if s" = s'. Thus to obtain the 
expectations [A^° | Xtq = s'] , s G c n c^, , we may solve the hnear system defined by 

= Ph, (5, a, s")^s[^l' I Xt, = 5^ Xi = 5^ ai = a] 



where , is given by Equation (28) and depends on the choice of s' . Next, if 5, s' E 5c, E 
c n c'^/, by the strong Markov property, 

E,[A^i I = s', Xto+1 = s", aro+i = a] = r(s, a, s")E[Afi | Xt, = s', Xi = s", To = 0] 

= r(5, a, s")E[A[° I Xto = s', Xi = s", To > 1] 
= r(s,a,s")E,»[A^° I Xt, = s'\ 

where the third equality follows from the second applying Lemma[6]with 

If = 5^, then we simply have Es[A^J | Xt^ — s' ^X\ — s" ^ a\ — o\— F(5, a, 5'). 

With these facts in hand, the compressed discount factors may be found from the ex- 
pectations {E5[A^° I Xtq = 5'], 5 E c n c^/} computed above, as 

E,[Ag \Xt,=s'\ 



for (5, 5') E supp^(P), where ^ is defined by Equation (29) and depends on the choice of 



5^ 



We note that for each destination bottleneck 5' E 5c, the linear system appearing in Propo- 
sition [i] has the same left-hand side as the corresponding linear system for 5' in Section 3.2.4 
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we can compute the compressed discounts essentially for free if we have previously computed 
the compressed rewards (or vice versa), provided the resulting solutions to the discount fac- 
tor equations are non-negative. If they are not non-negative, separate non-negative least 
squares solutions must be computed, although there may still be helpful preconditioning 
possibilities. 
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